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Final  Report 

SYSTEM  STRUCTURE  FOR  FAULT-TOLERANT  PROCESSES 
IN  DISTRIBUTED  SYSTEMS 

Summary 

This  is  the  final  technical  report  on  the  RADC  Post-Doctoral  Program  Contract  F30602-81-C-0205, 
Task  #  B-6-3508  titled  System  Architecture  for  Fault-Tolerant  Processes  in  Distributed  Systems,  funded 
through  the  University  of  Kansas  Center  for  Research.  The  primary  thrust  of  this  project  has  been  on  the 
development  of  system  architectures  and  protocols  for  building  fault-tolerant  distributed  systems. 

During  the  course  of  this  effort  we  addressed  varit^s  aspects  related  to  building  fault-tolerant  distri¬ 
buted  systems.  The  four  basic  functions  required  for  building  a  fault-tolerant  system  are:  error  detection, 
fault  diagnosis,  fault  isolation/reconfiguration,  and  error  recovery/restart  Error  detection  involves  detect¬ 
ing  those  states  of  the  conrputation  that  do  not  satisfy  the  specifications.  Fault-diagnosis  algorithms  identify 
the  malfunctioning  components  in  the  system.  Reconfiguration  techniques  isolate  the  faulty  components 
from  the  rest  of  the  system  and  possibly  replace  them  with  "healthy"  units.  Finally,  the  error  recovery  tech¬ 
niques  bring  the  system  back  to  some  consistent  state  before  restarting  the  operations  again.  In  some  sys¬ 
tems  a  sufficient  level  of  redundancy  is  provided  to  mask  the  effects  of  malfunctioning  components.  Such 
systems  continue  to  function  correctly  even  in  the  presence  of  some  limited  number  of  faulty  components. 
In  such  systems  one  does  not  have  to  execute  any  fault  diagnosis  or  error  recovery  algorithms  given  that 
there  exist  only  some  limited  number  of  faulty  components  in  the  system.  Such  redundancy  is  called  mask¬ 
ing  redundancy. 

In  this  research  effort  we  studied  and  investigated  algorithms  and  protocols  in  four  different  areas; 
fault  diagnosis,  error  recovery  in  systems  with  replication  of  processes  or  data,  error  recovery  based  on 
self-stabilization,  and  use  of  masking  redundancy  in  replicated  systems  using  agreement  protocols. 

This  final  report  is  a  collection  of  six  technical  reports  presenting  the  results  of  the  research  under¬ 
taken  during  the  course  of  this  contract  These  reports  are  listed  below: 
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(1)  Construction  of  Resilient  Actions  Using  Replication  and  Checkpointing  in  Distributed  Systems,  (A. 
Tripathi,  S.  Azadegan,  and  S.  Ranka) 

(2)  Constant  Expected  Time  Randomized  Byzantine  Agreement  Protocol  without  Shared  Secrets  and 
Cryptography,  (S.  Ranka,  A.  Tripathi,  and  S.  Azadegan) 

(3)  A  Protocol  for  Self-Stabilization  in  Binary  Trees,  (S.  Dong  and  A.  Tripathi) 

(4)  An  Improved  Algorithm  for  Termination  Detection  in  Asynchronous  Distributed  Computations,  (A. 
Tripathi  and  S.  Dcmg) 

(5)  Fault  Diagnosis  in  Distributed  Systems  —  Interim  Report,  (V.  Raghavan,  S.  Dong,  and  A.  Tripathi) 

(6)  Towards  an  Improved  Diagnosability  Algorithm,  (V.  Raghavan  and  A.  Tripathi) 

The  first  report  describes  a  system  architecture  for  building  resilient  processes  using  replication  and  check¬ 
pointing.  It  describes  protocols  for  managing  a  replicated  object  and  the  processes  executing  the  actions 
invoked  on  the  object.  In  our  design  one  of  the  copies  of  a  replicated  object  acts  as  the  primary  copy  for 
executing  a  requested  action.  A  process  is  created  to  execute  a  requested  action,  and  periodically  the  exe¬ 
cution  state  of  the  process  is  checkpointed  with  the  other  copies  of  the  object,  which  act  as  the  backup 
copies.  In  our  design  we  have  adopted  an  object-oriented  view  of  distributed  computing.  Nevertheless,  the 
protocols  developed  here  are  equally  applicable  to  conventional  process-oriented  models  of  computing.  In 
designing  these  protocols  we  first  build  a  facility  to  detect  if  a  site  in  the  system  is  up  or  down.  The  proto¬ 
cols  for  checkpointing  and  recovery  make  use  of  this  information.  We  assume  fail-stop  nature  of  site 
failures.  The  process  replication  protocols  function  correctly  even  in  the  presence  of  network  partitioning, 
in  which  case  every  partition  that  has  at  least  one  available  copy  of  the  object  will  continue  to  function 
correctly.  We  have  not  addressed  the  problems  related  to  merging  or  reconciling  the  copies  when  a  parti¬ 
tion  is  repaired. 

The  second  report  presents  an  agreement  protocol  that  provides  a  consistent  view  of  the  computation 
to  each  correctly  functioning  copy  of  a  replicated  process.  This  protocol  provides  a  mechanism  to  imple¬ 
ment  masking  redundancy  in  the  system.  The  algorithm  developed  here  is  probabilistic  and  does  not 
require  any  shared  secrets  or  cryptography. 
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The  third  report  presents  a  protocol  for  self-stabilization  in  binary  trees.  The  term  self-stabilization 
means  that  if  at  any  time  tlie  system  is  in  some  inconsistent  state,  then  the  normal  execution  of  each  process 
eventually  brings  the  system  b^k  to  some  consistent  state.  Such  systems  do  not  require  any  special  error 
handling  protocols.  The  protocols  for  normal  operations  are  sufficient  to  guarantee  recovery  from  an 
erroneous  state. 

The  fourth  report  presents  a  i;»otocol  for  detecting  the  termination  of  a  set  of  cooperating  communi¬ 
cating  processes.  The  interprocess  communication  is  based  on  asynchronous  message  passing. 

The  last  two  reports  address  the  problems  related  to  fault-diagnosis  in  interconnected  systems.  The 
first  of  these  two  reports  presents  a  survey  of  various  fault-diagnosis  algorithms  based  on  the  model  pro¬ 
posed  by  Preparata,  Metze,  and  Chien  (PMC  model).  The  last  report  presents  some  of  the  new  results  that 
we  have  obtained  in  the  direction  of  designing  more  efficient  fault-diagnosis  algorithms. 
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Construction  of  Resilient  Actions  in  Distributed  Systems  using 
Replication  and  Checkpointing 


Anand  Tripathi,  Shiva  Azadegan  and  Sanjay  Ranka 
Department  of  Computer  Science 
University  of  Minnesota 
Minneapolis,  MN  55455 


1.  Introduction 

This  report  addresses  the  problem  of  building  resilient  actions  and  processes  by  replicating  them  to 
have  a  higher  probability  of  surviving  site  crashes  and  completing  successfully  as  compared  to  their  non- 
replicated  implementations.  The  ability  of  a  system  to  continue  its  operations  in  spite  of  failures  of  some  of 
its  components  is  called  the  resiliency  of  the  system.  In  this  report  we  present  a  set  of  protocols  for  con¬ 
structing  such  resilient  operations  in  object-oriented  environments  by  replicating  the  objects  at  multiple 
sites  in  a  distributed  system.  We  have  used  an  object-oriented  model  of  distributed  computing;  neverthe¬ 
less,  these  protocols  are  equally  applicable  to  the  conventional,  process-oriented,  view  of  computation. 
The  protocols  described  in  this  report  facilitate  checkpointing  during  executions  of  actions  of  a  replicated 
object  and  restarting  of  an  action  at  a  different  copy  of  die  object  when  its  primary  execution  copy  crashes. 

The  primary  contributions  of  this  report  are  an  interrupt-driven  structure  of  the  error  recovery  proto¬ 
cols  in  replication  management  and  a  system  architecture  which  primarily  uses  the  unreliable  datagram 
facilities  for  ncnmal  operations  and  makes  limited  use  of  reliable  message  transmission  protocols  for 
transmitting  exception  conditions  in  the  network  during  error  recovery.  We  also  present  a  novel  protocol 
for  status  monitoring  in  the  system  using  unreliable  datagrams.  One  of  the  features  of  our  design  is  that  we 
do  not  require  sites  to  have  access  to  a  secondary  storage  for  recovery.  This  makes  it  ideal  for  applications 
where  providing  secondary  storage  is  infeasible.  Nevertheless,  availability  of  secondary  storage  at  certain 
sites  will  enhance  the  reliability  of  the  system. 

We  assume  that  the  distributed  system  environment  underlying  our  protocols  consists  of  fail-stop 
processors  communicating  across  a  communication  network.  The  fail-stop  assumption  implies  that  a  mal¬ 
functioning  processor  simply  fails  by  crashing  and  does  not  behave  maliciously  w  act  as  an  adversary  to 

This  woilc  was  supported  by  Rome  Air  Development  Center  under  the  Post-Doctoral  Fellowship  Program  (Grant  No.  F-30602- 
8I-C-0205). 


1 


the  system.  Thus  the  need  of  any  Byzantine  Agreement  protocol  [1 1]  is  eliminated  in  our  design.  The  com¬ 
munication  network  provides  a  datagram  facility  which  transports  messages  on  the  "best  effort"  basis,  i.e., 
the  delivery  of  messages  is  not  guaranteed.  'M'e  do  not  require  the  host  sites  to  have  access  to  secondary 
storage  devices.  The  communication  network  is  assumed  to  introduce  delivery  delays  or  loss  of  messages. 
In  order  to  keep  the  primary  focus  of  this  report  on  the  management  of  primary/secondary  copies  and 
recovery  when  the  primary  copy  fails,  we  will  assume  that  the  conununication  network  never  gets  parti¬ 
tioned.  In  the  absence  of  this  assumption  we  will  have  to  incorporate  some  network  partition  repair  proto¬ 
cols  [6]  in  our  designs. 

In  the  past  several  systems  have  been  designed  to  support  replication  of  objects  or  processes.  Some 
of  the  most  noteworthy  of  these  designs  are  Tandem’s  Guardian  operating  system  [1] ,  ISIS  [3],  FTMP  [9] 
and  SIFT  [13].  We  have  adopted  several  concepts  from  the  designs  of  ISIS  and  Tandem’s  Guardian 
operating  system.  SIFT  supports  replication  of  processes  for  real-time  applications  and  recovery  from  arbi¬ 
trary  failures;  it  uses  Byzantine  agreement  protocols  for  inter-process  communication  between  the  repli¬ 
cated  copies.  ISIS  uses  a  variety  of  reliable  broadcast  primitives  [4]  for  replication  management  In  Guar¬ 
dian  replication  is  limited  to  pairs  only.  We  assume  a  reliable  broadcast  protocol  as  described  in  [5]  only 
for  communicating  signals  (exception  conditions)  during  the  recovery  phases.  Thus  our  design  uses  the 
more  expensive  reliable  broadcast  primitives  only  during  the  recovery  phases  for  communicating  signals. 

In  an  object  oriented  environment  each  object  comprising  the  system  encapsulates  some  local  data 
and  a  set  of  actions  to  manipulate  the  data.  We  might  require  that  some  of  these  objects,  which  might  be  of 
critical  importance  in  the  reliability  and  availability  of  the  system,  have  higher  resilience  to  error  than  the 
other  objects.  This  can  be  achieved  by  replication  of  these  objects  at  different  sites.  We  use  the  concept  of 
a  k-resilient  object  [3],  which  is  guaranteed  to  remain  operational  up  to  k  site  failures.  We  use  the  con¬ 
cept  of  k -resiliency  in  the  context  of  resilient  actions.  In  this  report,  we  describe  a  scheme  for  establishing 
distributed  checkpoints  and  recovery  from  site  crashes  that  can  be  used  to  implement  k  -resilient  actions. 

The  next  section  describes  an  abstract  view  of  resilient  actions  (and  objects)  in  our  system.  Section  3 
outlines  the  requirement  for  achieving  this  functionality  in  replicated  environments.  Section  4  presents  the 
functional  layers  in  our  design.  It  also  outlines  the  important  assumptions  in  our  design.  Section  6 
describes  the  protocols  for  normal  primary/secondary  copy  operation.  Section  7  contains  the  protocols  for 
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error  recovery  and  the  arguments  for  their  correctness. 

2.  An  Abstract  View  of  Resilient  Objects  and  Actions 

Before  discussing  the  issues  involved  in  replicating  an  object  and  its  actions,  this  section  describes 
the  logical  view  of  a  resilient  object  and  its  associated  actions.  Our  design  achieves  this  abstract  view  of  an 
object  by  replicating  it.  Each  object  in  an  object-oriented  environment  encapsulates  some  data  and  a  set  of 
actions.  To  perform  any  execution  on  a  particular  object,  the  object  is  called  to  execute  one  of  its  actions.  A 
result  is  returned  after  the  action  is  completed.  This  action  may  be  viewed  as  a  sequence  of  synchronous 
operations  (aj  a2,...,a,).  Each  operation  is  a  computation  on  one  of  the  following  types  of  data:  (1)  local 
data  of  the  object,  (2)  temporary  data  of  the  object,  (3)  dui  of  another  object  (in  case  of  nested  action). 
We  will  use  the  term  "environment  of  an  action"  to  refer  to  the  temporary  data  maintained  by  the  action 
and  the  state  variables  of  the  object  to  which  that  action  belongs. 

In  the  abstract  view,  each  object  has  access  to  some  k  -resilient  storage'  which  survives  crashes  of  up 


Figure  1:  Abstract  view  of  Resilient  Objects 
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to  k  different  processors  or  storage  units  in  the  system.  This  k-resilient  storage  is  used  for  saving  check¬ 
points  during  execution  of  an  action.  The  following  discussion  presents  this  abstract  view  of  a  resilient 
obj^t  using  an  example.  During  nomal  operations,  depicted  in  Figure  1,  when  an  object  A  is  called  to 
perform  Action  1,  it  will  do  a  set  of  operations  (1  to  6  in  this  example).  In  performing  this  invocation  it  will 
call  object  B  and  C  to  perform  some  actions.  If  the  invocation  is  successfully  executed,  then  the  object  will 
send  some  result  to  the  object  it  was  called  from,  i.e.  client  object.  However  in  case  of  failure,  object  A 
may  temporarily  halt  its  execution.  When  the  site  comes  up,  based  on  its  log  information  maintained  on  the 
k-resilient  stwage,  it  may  restart  its  execution  from  some  previous  operation  i.e.  it  may  re-execute  sotrre 
operations.  In  order  to  cope  with  the  consistency  problems  which  may  arise  due  to  the  re-execution  of 
operations,  we  need  that  in  case  of  failure,  any  action  which  may  be  re-executed  on  an  object  must  be 
idempotent,  so  that  when  the  same  action  is  performed  on  an  object  it  returns  the  same  value,^  without 
changing  its  local  data.  This  can  be  achieved  in  two  ways.  The  first  way  to  achieve  this  is  to  undo  all 
actions  performed  since  the  last  execution  of  the  action,  and  then  execute  the  action  again.  Thus  perform¬ 
ing  the  action  again  returns  the  same  value  and  has  the  sanre  effects  on  the  local  data  as  before.  The  second 
way  is  to  retain  the  results  of  the  dctXon  performed  on  the  object  long  enough,  so  that  if  a  previously  exe¬ 
cuted  actions  on  the  object  is  asked  to  be  {>erformed  again,  the  retained  value  can  be  sent  to  the  client 
object.  This  will  retain  the  idempotency. 

Let  us  now  consider  a  recursive  call,  as  shown  in  Figure  2,  on  object  A.  Object  A  executes  an  action 
which  results  in  a  chain  of  i;  vocations  on  other  objectsf  the  length  of  which  is  nil  in  case  A  calls  itself 
directly),  the  last  of  which  calls  A  again.  In  such  a  case,  the  state  of  local  data  of  object  A,  which  may  be 
modified,  should  be  the  state  in  which  last  call  was  made  from  object  A.  This  state  can  be  easily  identified 
by  looking  at  the  call-id  of  the  recursive  call.  In  our  design,  a  call-id  for  any  action  invoked  by  an  object  is 
obtained  by  concatenating  the  call-id  of  the  action  this  object  was  invoked  with,  its  object  UID,  action-id 
and  operation  number.  The  call-id  for  the  top-most  level  action  is  a  globally  unique  identifier.  In  this 
example  object  A  calls  another  object  B  in  operation  4  of  Action  1  and  object  B  calls  object  C  and  object 
C  in  turn  calls  object  A.  Now  the  slate  of  object  A  which  this  invocation  of  operation  on  A  by  C  should  see 

'  Such  a  storage  can  be  implemented  using  either  some  disk  storage  devices  or  the  volatile  storage  of  some  other  sites  in  the 
system. 

’  One  should  note  here  that  such  a  model  will  not  applicable  in  systems  where  the  results  of  some  invocations  are  time- 
dependent. 
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is  the  state  in  which  object  A  was  left  when  the  operation  4  was  performed. 

In  the  above  examples  we  described  some  characteristics  of  an  action  performed  on  an  object.  We 
want  these  characteristics  be  preserved  when  one  or  more  of  the  objects  is  replicated,  i.e.  whether  an 
object,  on  which  an  action  is  invoked,  is  replicated  or  non-replicated  is  transparent  to  the  client  object.  In 
the  later  sections,  we  describe  a  scheme  on  how  to  replicate  objects  and  their  actions  preserving  the  above 
required  characteristics. 

3.  Replicated  Environment 

In  a  replicated  distributed  environment  an  object  can  continue  execution  of  its  actions  despite  site 
failures.  For  achieving  this  objective,  information  about  the  execution  must  be  distributed  among  different 
processes  as  the  execudon  proceeds  so  that  another  copy  of  the  process,  at  another  site,  can  take  over  and 
continue  the  operation  when  the  executing  copy  fails.  Moreover  it  is  necessary  that  copies  of  an  object  be 
consistent  in  performing  further  actions.  This  requirement  is  important  for  a  consistent  behavior  of  all 
cc^ies  as  a  single  object 

In  a  replicated  object-oriented  environment  an  action  is  performed,  as  in  the  non-replicated  case,  by 
invoking  an  action  on  a  replicated  object.  In  fact,  it  is  totally  transparent  to  the  invoker  that  the  called 
object  is  replicated.  These  actions  are  guaranteed  to  be  executed  atomically.  The  operating  system  under¬ 
lying  our  design  supports  locating  objects  (or  some  of  the  copies  of  a  replicated  object)  and  delivering  the 
invocation  messages  to  them.  In  addition,  these  objects  can  be  accessed  concurrently,  which  requires  some 
concurrency  control  mechanism  in  order  to  preserve  the  object  consistency.  Many  such  nvechanism  have 


Figure  2:  Recursive  Call 
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already  been  developed  [2]  and  we  will  not  discuss  any  of  these  here.  Our  scheme  is  orthogonal  to  any 
specific  concurrency  control  mechanism. 

When  an  action  on  an  object  is  invoked,  we  must  determine  which  copies  of  the  object  will  start  the 
execution  of  the  action.  We  refer  to  such  copies  as  the  primary  copies  for  that  invocation.  We  can  either 
have  multiple  copies  of  an  object  executing  the  same  action  concurrently  or  only  a  single  copy  executing 
the  action.  We  chose  to  have  a  single  copy  executing  as  we  believe  that  multiple  concurrent  executions  of 
an  action  is  a  wastage  of  system  resources  and  it  does  not  increase  the  availability  of  the  system  in  a  very 
significant  way.  In  some  situations  one  might  require  to  execute  mc»e  than  one  copy  concurrently.  In  the 
scheme  which  we  describe  here  no  consistency  requirement  will  be  violated  even  if  we  allow  simultaneous 
execution  of  multiple  copies.  Since  we  chose  to  have  a  single  ct^y  executing,  we  must  determine  which 
c<^y  should  start  the  operation.  There  are  two  possible  cases:  either  both  the  invoker  and  invoked  object 
have  copies  on  the  same  site  or  they  have  copies  on  different  sites.  In  the  former  case,  it  is  logical  to  choose 
the  site  for  the  primary  copy  as  the  one  that  hosts  both  objects.  In  the  latter  case,  the  site  can  be  chosen 
either  based  on  some  criterion  such  as  of  CPU  utilization  of  the  site  or  physical  distance  of  the  site  to  the 
invoker,  etc.,  or  the  site  can  be  chosen  statically  by  having  a  default  primary  cqpy  for  each  action.  We 
chose  to  have  a  default  primary  copy,  which  initiates  the  execution,  when  both  the  invoker  and  invoked 
objects  do  not  have  copies  on  the  same  site.  Once  the  primary  copy  of  an  action  starts  the  execution,  it 
periodically  sends  synchronous  and  asynchronous  checkpoint  information  to  all  secondary  copies.  If  the 
primary  copy  of  an  action  fails,  then  the  recovery  protocol  is  initiated  to  elect  a  new  primary  copy. 

4.  System  Architecture  for  Replication  Management  and  Recovery 

An  abstract  view  of  the  system  architecture  is  shown  in  Figure  3.  The  top- most  layer  represents  the 
processes  executing  the  currently  on-going  actions  of  a  replicated  object  by  its  copies  at  different  sites. 
Such  a  process  may  be  either  in  the  normal  computation  (i.e.,  executing  the  action  as  the  primary  or  secon¬ 
dary  copy)  or  in  the  recovery  phase  (i.e.  executing  election  protocol).  The  primary  focus  of  this  report  is 
on  the  protocols  executed  by  these  processes.  In  our  model  each  site  has  only  one  copy  of  the  object. 

Underlying  these  processes,  corresponding  to  each  copy  of  the  replicated  object,  there  is  an  object 
manager  process  which  is  responsible  for  scheduling  these  processes,  enforcing  necessary  concurrency 
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control  protocols,  and  interrupting  these  processes  when  some  exception  conditions  (signals)  arise. 
Corresponding  to  every  executing  action  at  a  copy  of  a  replicated  object,  there  is  a  process  which  executes 
that  action.  We  refer  to  this  process  as  the  primary/secondary  copy  of  that  action.  The  object  manager  also 
coordinates  with  the  object  managers  at  other  sites  to  maintain  the  current  status  of  the  replication 
configuration,  i.e.  the  status  of  die  other  copies  as  or  down.  For  this  purpose  each  object  manager 
periodically  executes  the  status  monitoring  protocol  in  which  it  broadcasts  status  information  to  other 
object  managers  which  are  participating  in  the  same  action;  we  will  denote  this  period  by  T.  In  this  proto¬ 
col  we  also  assume  that  the  clocks  in  the  network  are  synchronized  using  some  network  clock  synchroni¬ 
zation  algorithm[10].  The  following  properties  of  this  status  monitoring  protocol,  the  details  of  which  are 
given  in  the  Appendix,  are  of  interest  to  us  in  this  design.  (1)  If  a  copy  crashes  during  the  interval 
[kT,  {k+l)T],  then  other  up  copies  will  detect  this  failure  by  the  time  (k+‘i)T.  (2)  If  a  copy  restarts  during 
the  interval  [kT,  (A+l)7'],  then  with  a  very  high  probability^  all  other  up  copies  will  know  about  this  restart 
by  the  time  (k+Z)!.  (3)  If  at  time  kT  a  copy  incorrectly  assumes  some  other  copy  to  be  down,  then  there 
is  very  high  probability  that  this  incorrect  view  will  get  corrected  by  time  (k+l)T. 

With  every  object  manager  some  configuration  related  data  is  maintained.  This  includes  list  of  the 
unique  identifiers  (UIDs)  of  the  other  ctpies,  list  of  copies  which  are  currently  available,  default  primary 
copy,  the  current  operation  being  performed  and  their  primary  copies  (Figure  4).  Moreover,  for  every  repli¬ 
cated  object  there  is  a  static  ordering  of  copies,  called  the  election  manager  list  (EM  list),  which  is  used  to 
designate  one  of  the  copies  as  an  election  manager  to  elect  a  new  primary  copy  for  an  on-going  invocation, 
in  case  of  primary  copy  failure.  The  election  manager  elects  one  of  the  available  copies,  and  hopefully  the 
one  with  the  latest  checkpoint,  as  the  new  primary  copy. 

Our  protocols  are  interrupt-driven  in  the  sense  that  the  execution  of  a  statement  during  the  normal 
computaticm  phase  or  during  the  recovery  phase  may  get  interrupted  if  any  of  the  specified  exception  con¬ 
ditions  arises.  An  exception  condition  is  enabled  if  an  exception  handler  is  associated  with  any  statement 
or  any  of  its  outer  blocks.  Reception  of  a  signal,  for  an  enabled  exception  condition,  at  an  object  may 
cause  interruption  of  some  of  its  currently  executing  actions,  we  have  referred  to  signals,  such  as 

’  This  probability  is  a  function  of  the  reliability  of  the  underlying  datagrams  communication  network. 
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Figure  3:  System  Layers  of  Abstraction 

Election  Call,  Election JTermination,  Election  Abort,  as  global  signals  since  they  are  sent  from  one  site  to 
another  site;  in  other  words  they  are  sent  through  the  reliable  broadcast  system.  Local  signals,  such  as 
EM  Failure  and  Primary  Copy  Failure,  are  those  which  are  generated  by  the  object  manager  as  the  result 
of  detecting  some  failures.  The  local  signals  are  sent  only  to  the  local  processes  of  the  object.  In  contrast 
to  signals,  the  arrival  of  a  message  does  not  cause  any  interrupt  at  the  destination  object. 

Whenever  an  object  manager  detects  (by  executing  status  monitoring  protocol),  that  some  other  copy 
X  has  gone  down,  it  updates  its  configuration  data.  It  also  checks  whether  x  had  been  the  primary  copy  of 
some  on-going  actions,  in  which  case  if  sends  a  signal  called  Primary  Copy  Failure  to  the  local  processes 
executing  those  actions  as  secondary  copies.  This  signal  will  cause  the  secondary  copies  to  execute  the 
recovery  protocol  (election  protocol).  Similarly,  if  x  was  the  election  manager  copy  for  some  action,  an 
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exception  condition  called  EM  Failure  is  signaled  to  the  local  process,  corresponding  to  that  action,  exe¬ 
cuting  the  recovery  protocol.  When  an  election  manager  cqsy  starts  election,  it  signals  an  exception  con¬ 
dition  called  Election  Call  to  interrupt  all  other  copies  to  participate  in  the  election.  This  signal  causes  all 
processes  (at  those  copies)  executing  as  secondary  copies  to  participate  in  electing  a  new  primary  copy. 
Similarly,  the  successful  completion  of  an  election  is  signaled  as  a  condition  called  Election  Terminate . 
An  unsuccessful  completion  of  the  election  is  signaled  using  Election  Abort. 

The  underlying  communication  system  provides  an  unreliable,  asynchronous  message  passing  primi¬ 
tive  send(msg)  to  destination  which  sends  the  message  msg  to  the  object  (or  objects)  specified  in  the  desti¬ 
nation  field.  Also  it  provides  a  primitive  receive(msg)  from  source  for  receiving  a  message  msg  from  the 
object  (or  objects)  specified  in  the  source  field.  In  this  model  receive  is  a  blocking  operation,  i.e.  a  call  to 
this  function  returns  if  and  only  if  expected  message(s)  is(are)  received,  and  in  such  a  case  this  function 
returns  true.  The  source  and  destination  are  specified  as  the  unique  identifiers  (UIDs)  of  the  objects.  In 
case  of  replicated  objects,  a  specific  copy  is  specified  by  extending  the  unique  identifier  with  the  ID  of  the 
host  site  of  that  copy,  called  ExtendedjObjectJUID,  (assuming  that  only  one  copy  resides  at  any  host  site). 
The  send  primitive  is  implemented  using  the  datagram  facility  which  does  not  guarantee  the  delivery  of  a 


/*  List  of  the  copies  of  the  replicated  object,  and  their  status  */ 

Configuration  :  list  of  (copy  :  Extended_Object_UID; 

status :  (up,  down)); 

availcc^ies  :  list  of  copies  in  the  Configuration  list  with  up  status; 

/*  Sutic  list  of  copies  which  can  act  as  election  managers  */ 

EM  List  list  of  Object  UID;  /*  object’s  copies  */ 

/*  Current  calls  being  executed  on  the  object  */ 

Current  Call  Table  :  list  of  (Call  id  :  String; 

Actionid  ;  Integer; 

Initialversion:  Memory_address; 

Latest_chkpnt  :  Integer; 

Current  version:  Memory  address; 
Primary_C(^)y  :  Extended  Object  UID) 

/*  Retained  values  of  the  different  calls  performed  on  the  object  */ 

Retained  values  :  list  of  (Call  id :  String; 

Response:  Message  type 

) 

Figure  4 :  Data  Structures  Maintained  with  an  Object 
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message. 


Global  exception  conditions  (signals)  are  communicated  by  a  copy  of  the  replicated  object  to 
another  copy  (or  copies)  using  the  primitive  signcd(condition)  to  destination  which  raises  the  specified 
exception  condition  at  the  objects  specified  by  the  destination  field.  It  is  possible  to  pass  some  parameters 
to  the  destination  object  when  signaling  an  exception  condition.  We  assume  here  that  the  signals  are  com¬ 
municated  using  the  reliable  broadcast  protocol  described  in  (5].  This  broadcast  protocol  has  the  following 
properties;  (1)  All  objects  in  the  destination  list  receive  the  signal.  (2)  The  order  in  which  signals  are  sent 
by  a  sender  is  preserved  at  the  receivers.  (3)  Signals  sent  by  different  senders  are  received  by  all  common 
receivers  in  the  same  order.  We  assume  that  the  reliable  broadcast  protocol  tries  to  persistently  deliver  sig¬ 
nals  to  only  those  copies  which  remain  up  during  the  execution  of  the  broadcast  protocol  and  ignores  any 
copy  which  goes  down.  For  the  correctness  of  our  recovery  protocols  we  require  that  the  status  broadcast 
period  T  be  selected  such  that  all  signals  are  guaranteed  to  be  delivered  to  all  the  up  copies  within  2T. 

In  this  report  we  do  not  discuss  the  copy  restart  protocol  which  is  executed  by  the  object  managers 
when  a  failed  copy  tries  to  rejoin  the  configuration.  To  ensure  the  cwrecmess  of  the  election  protocol  in 
our  design,  we  assume  that  the  restart  protocol  has  the  property  that  it  delays  inclusion  of  a  restarting  copy 
in  the  configuration  if  a  recovery  (election)  is  currently  on-going  for  some  action.  Also,  if  the  permission 
to  join  the  configuration  is  signaled  to  a  restarting  copy  during  the  interval  [kT ,(,k+\)T],  then  the  latest  by 
the  time  {k+2)T  this  copy  will  start  sending  its  status  messages  to  all  other  copies.  Which  means  that  with 
very  high  probability  all  other  copies  will  include  this  copy  in  their  avail  list  by  time  (^+4)T.  A  new  (or 
restarted)  copy  does  not  act  as  an  intermediary  to  include  other  new  copies  in  the  configuration  during  this 
4T  period. 

5.  Notation  for  Protocol  Descriptions 

All  protocols  are  presented  in  a  Pascal-like  ncMation.  Most  of  the  constructs  have  conventional 
meaning.  The  loop...end  construct  represents  iteration  of  the  enclosing  sequence  of  statements;  the  itera¬ 
tion  is  terminated  with  the  execution  of  an  exit  statement.  We  also  use  an  exception  handling  model  with 
the  statements  to  provide  an  interrupt  driven  execution  of  the  protocols. 


Page  10 


We  use  the  termination  model  of  exception  handjing[12]  to  describe  the  protocols.  Similar  models 
have  been  used  in  several  programming  languages  such  as  Ada[8],  Gypsy[7],  and  CLU[12].  In  our  model 
an  exception  handling  block  can  be  associated  with  any  program  statement  for  dealing  with  anticipated 
exception  conditions.  The  exception  handling  block  lists  the  exception  conditions  and  the  actions  (called 
exception  handlers)  associated  with  them.  If  any  of  the  listed  exception  conditions  arise  during  the  execu¬ 
tion  of  its  associated  statement,  the  corresponding  exception  handler  is  executed  and  the  execution  of  this 
handler  terminates  the  execution  of  that  statement  For  example,  in  the  program  fragment  shown  below,  a 
timeout  condition  will  terminate  the  computation  enclosed  within  the  begin..end  block  and  execute  the 
actions  associated  with  the  timeout  condition.  The  set jimer( timeout _period)  construct  is  used  to  set  the 
timer  with  the  desired  timeout  value;  this  will  cause  raising  of  the  timeout  exception  after  the  expiration  of 
the  specified  units  of  dme. 

set_time(100): 

begin 

S1:S2;....  ;Sn; 

end  when 

Cl:  XI; 

C2:  X2; 

C3:  X3; 

timeout:  /*  execute  the  timeout  action  ♦/ 
end; 

S; 

In  the  above  example  the  execution  of  the  sequence  of  statements  Sl;S2;...;Sn  will  be  interrupted  and  ter¬ 
minated  if  any  of  the  exception  conditions  Cl,  C2,  C3  or  timeout  is  signaled.  If  more  than  one  conditions 
are  signaled,  then  the  exception  handler  corresponding  to  only  one  of  the  signaled  conditions  will  be 
selected  and  executed.  This  selection  is  non-deterministic.  After  executing  the  exception  handler,  the  exe¬ 
cution  will  proceed  with  the  next  following  statement,  which  is  5  in  the  above  example.  If  none  of  the 

specified  exception  conditions  arise  during  the  execution  of  the  sequence  SI; . ]Sn,  then  all  exception 

handlers  are  ignored  and  the  execution  proceeds  with  the  next  statement 

We  also  use  the  when...end  construct  to  wait  concurrently  on  some  number  of  exception  conditions 
or  message  reception  events.  The  execution  of  such  a  statement  completes  when  any  of  the  specified  con¬ 
dition  arises;  in  such  an  event  the  corresponding  exception  handler  is  executed.  If  more  than  one  conditions 
arise  concurrently,  then  one  of  them  is  selected  non-detenninistically  and  the  corresponding  exception 
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handler  (which  can  be  a  null  statement)  is  executed.  For  example: 

when 

Cl:  XI;  /♦  Cl  and  C2  are  some  exception  conditions  */ 

C2:  X2; 

receive  (msg)  from  a:  X3; 
dmeout' ; 

end; 

The  execution  of  the  above  statement  will  complete  if  any  of  the  conditions  Cl,  C  2  or  timeout  is  raised,  or 
a  message  is  received  from  a ,  aitd  the  corresponding  exception  handler  is  executed.  In  this  exan^le,  in 
case  of  timeout ,  no  action  is  taken  for  exception  handling.  In  case  of  nested  blocks,  the  exception  condi¬ 
tions  defined  with  the  outer  blocks  always  have  priority  over  the  ones  associated  with  the  ^closing  blocks. 

Sometimes  it  is  required  that  all  external  exception  handling  (i.e.,  defined  with  the  outer  blocks)  be 
masked  and  kept  pending  during  the  execution  some  sequence  of  statements.  As  a  notation  we  enclose 
such  a  sequence  of  statements  within  square  brackets  "["  and ")”.  This  only  means  that  any  exception  han¬ 
dling  associated  with  the  outer  blocks  will  not  be  effective  during  the  execution  of  this  sequence.  If  any  of 
the  statements  in  this  sequence  have  exception  handling  associated  with  them,  then  thore  exception 
handlers  wiU  still  be  enabled  and  executed  when  those  conditions  arise.  (This  makes  the  execution  of  the 
sequence  atomic  with  respect  to  outer  exception  conditions.) 

Using  the  unreliable  communication  primitives  described  earlier  we  implement  a  new  primitive 
sync  send  as  described  in  Figure  5.  The  sync  send  primitive  persistently  tries  to  send  a  message  to  each  of 
the  destination  objects  until  a  response  is  received  from  all  the  objects  which  appear  to  be  up;  it  ignores 
any  destination  objects  which  goes  down  during  the  broadcast  (this  sync  send  primitive  should  not  be  con¬ 
fused  with  the  reliable  broadcast  protocol  assumed  for  sending  signals).  When  sync  send  is  invoked,  a 
procedure  can  be  passed  to  it  through  the  formal  parameter  ack  handler  to  handle  the  response  messages 
from  the  destination  objects. 

6.  Replication  Management  Protocol 

This  section  describes  the  Execution  Protocol,  shown  in  Figure  6,  executed  by  each  of  the  replicated 
copies.  A  copy  can  function  as  a  primary  copy  (  executing  the  protocol  shown  in  Figure  7)  or  as  a  secon¬ 
dary  copy  (executing  the  protocol  shown  in  Figure  8)  and  this  role  can  change  dynamically  due  to  the 
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Procedure  sync_send(nisg:  message;  dest;  list  of  Object  UID,  ack  handler) 
begin 

All  ;=  [  dest  ]; 

loop 

if  All  =  <()  then  exit  else; 
send(msg)  to  All: 
set_timer(timeout jjeriod) ; 
loop 

if  All  =  (j>  then  exit; 
when 

receiveC  ack  )  from  some  a  in  All: 

All  :=  All  -[a]; 
ack_handler(ack,  a); 

down(a)  for  some  a  in  All:  All  :=  All  -  a; 
end  /*  of  when  */ 
end  /*  of  loop  *! 
when 

timeout:; 
end  I*  of  when  */ 
end;/*  of  loop  ♦/ 
end;  /*  of  sync  send  ♦/ 


Figure  5:  Synchronous  Send 


failure  and  recovery  as  determined  by  the  execution  protocol  described  below.  The  description  of  the  pro¬ 
tocol  is  presented  here  in  terms  of  its  four  major  components  -  protocols  for  primary  copy,  secondary 
copy,  election  manager,  and  election  participant  When  an  action  is  invoked,  after  the  execution  of 
appropriate  concurrency  control  protocols  andxlesignadon  of  the  primary  copy  ,the  action  is  scheduled  for 
execution.  All  the  up  copies  of  the  object  execute  the  Execution  Protocol  described  in  Figure  6.  The 
invoked  action  is  passed  as  a  parameter  to  this  {xocedure.  A  copy  can  determine  whether  it  is  the  primary 
copy  or  a  secondary  copy  for  the  action  by  executing  the  function  !  AM  PRIMARY  (A ).  It  then  executes 
the  apprc^riate  protocol  as  described  below.  One  must  note  here  that  a  copy  can  be  concurrently  acting 
both  as  the  primary  copy  for  some  invocations  and  as  the  secondary  copy  for  the  others.  This  feature  can  be 
effectively  used  in  incorporating  various  load  balancing  protocols  in  our  design.  The  execution  of  the  pri¬ 
mary  or  secondary  copy  protocol  can  get  interrupted  if  the  local  signals  Primary _Copy  Failure  or 
Election  Call  are  raised.  Under  these  conditions,  the  exception  handler  for  the  condition  is  executed  which 
essentially  forces  each  copy  to  participate  in  the  Election  Protocol  to  elect  the  new  primary  copy.  It  is  pos¬ 
sible  for  the  execution  of  the  Election  Protocol  at  a  copy  to  get  interrupted  if  any  of  the  conditions 
EM  failure  or  Election  Terminate  is  raised.  After  such  an  interruption  and  the  execution  of 
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coresponding  exception  handler,  the  election  protocol  continues  until  the  new  primarj-  copy  is  known. 

6.1.  Primary/Secondary  Copy  Execution 

When  an  object  starts  executing  an  action,  the  printary  copy  for  that  action  first  performs  a  synchro¬ 
nous  checkpoint  with  the  secondary  copies.  This  informs  all  other  (secondary)  copies  that  a  call  was  made 
with  a  particular  unique  number  for  performing  an  action.  Moreover  all  the  copies  should  be  in  the  same 
state  to  perform  that  action,  so  appropriate  checkpointing  is  also  required. 

After  performing  the  above  steps,  the  primary  ct^y  executes  action  in  the  same  fashion  as  an  unre¬ 
plicated  object,  but  periodically  sends  synchronous  or  asynchronous  checkpoints  to  other  copies  (see  Pri¬ 
mary  Copy  Execution  protocol  in  Figure  7).  Such  checkpoint  operations  are  insened  in  the  code  by  the  pro¬ 
grammer.  All  checkpoints  messages  are  sent  using  the  unreliable  datagram  facility.  All  checkpoints  are 
numbered  in  the  increasing  order  starting  with  initial  checkpoint  as  0.  A  checkpoint  message  contains 
complete  state  of  the  object’s  local  data  which  has  been  modified  by  the  primary  copy.  A  secondary  copy 


Procedure  Execution_Protocol  (A:  Action); 
var 

crash_testart;  Boolean;  /*  indicates  whether  a  crash  has  occurred  during  the  execution  of  fiiis  action  *! 

Procedure  Primary_Copy_Protocol(A:  Action);  /*  Figure  7  */ 

Procedure  Secondjuy  Copy  ProtocoKA;  Action);  I*  Figure  8  •/ 

begin 

crashrestart  ;=  false; 

loop 

begin 

if  l_AM_Primary(A)  /*  This  function  is  true  iff  this  copy  is  currently  the  primary  copy  */ 
then  Primary_Copy_Protocol(A); 

exit;  /*  completion  of  the  iximary  copy  execution  */ 
else  Secondary  Copy_Protocol(A); 

exit;  /♦  completion  of  the  secondary  copy  execution  */ 
end  when 

Primary  Copy  Failure,  Election  Call: 

I*  detection  of  primary  copy  failure  or  an  election  has  been  started  */ 
primary  copy  :=  unkiiown; 

ElectionProtocol ; 
crash  restart  :=  true; 
end;  /*  of  when  *! 

end; 

end;  /*  end  of  Execution  Protocol  ♦/ 

Figure  6;  Execution  Protocol 
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receiving  a  checkpoint  is  not  required  to  wait  for  any  previous  checkpoint  messages  except  the  initial  mes¬ 
sage.  In  order  to  preserve  the  correctness  of  checkpoint  information  kept  by  secondary  copies  ,the  excep¬ 
tion  conditions  are  masked  while  a  secondary  copy  is  storing  the  received  information.  In  case  of  synchro¬ 
nous  checkpoints,  the  primary  copy  waits  for  all  the  other  copies  to  respond  before  it  continues  the  execu¬ 
tion  of  the  action.  However  in  an  asynchronous  checkpoint,  the  primary  copy  simply  broadcasts  the  check¬ 
points  and  continues  the  execution  of  the  action.  Thus  if  the  primary  copy  does  not  go  down,  the  action  is 
performed  almost  in  the  same  way  as  in  the  case  of  an  unreplicated  object. 

After  the  end  of  all  the  operations,  a  final  synchronous  checkpoint  is  performed.  In  this  checkpoint 
all  the  final  values  of  the  modified  local  data  and  the  results  to  be  sent  to  the  invoker  are  recorded  by  other 
copies.  In  case  of  the  primary  copy  failure  during  any  of  the  three  phases,  a  new  primary  copy  is  selected 


Procedure  Primary _Copy_Protocol  (A;  Action); 
begin 

I:  if  not  crash_restart 
then 

/*  Establish  the  first  checkpoint  */ 

sync_send(<Call_  id,Cur_version,Initial>,  avail  copies,  ack_hdlr); 
else 

I*  load  the  latest  checkpoint  information  *! 

II:  /*  Perform  the  action  */ 

loop 

if  end_of_action(A)  then  exit; 
case  operation  of 

A_checlqx)int :  t*  Establish  an  asynchronous  checkpoint  */ 

send(<A_chkpnt,  Call  id,  Cur_version,  chkpnt#>)  to  avail  copies; 

S  checkpoint :  I*  Establish  a  synchronous  checkpoint  */ 

sync_send(<S_chkpnt,  Call  id,  Cur_version,  chkpnt#>,  avail  copies,  ack  hdlr); 
Otherwise  :  [  /♦  Execute  the  requested  operation  with  all  exceptions  masked  */  ] 


end; 

end; 

HI:  /♦  Establish  final  checkpoint  *! 

sync_send(<Call_id,  Cur  version,  Final>,  avail  copies,  ack  hdlr); 
/*  Send  the  result  to  the  invoker  ♦/ 
sync_send(<result>,  invoker,  ack  hdlr) 

end;  /*  of  Primary  Copy  Protocol  *! 

Figure  7:  Primary  Copy  Execution  Protocol 
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Procedure  Secondaiy  Copy  Protocol  (A  :  Action); 

begin 

k>op 

receive(Chkpt_msg)  from  primary _copy 
[  case  Chkpt_msg.chkpnt_num  of 

Initial :  I*  add  the  received  information  to  current  call  table  for  the  specified  action  *! 
Callid  :=  Callid; 

Initial  version  :=  Address(cur_version); 

Primary  Copy  :=  Sender  id; 

Otherwise  :  /*  update  the  current  call  tbl  fw  the  given  call-id  *! 
if  chkpntnum  >  Latestchkpnt 
then 

Latestchkpnt  :=  chkpntnum; 
cunent  version  ;=  Addressfcur  version); 

end;] 

/*  send  an  acknowledgement  back  */ 

if  S  chkpnt  then  send(ack )  to  Primary  Copy  ; 

if  chkpnt  num  =  Final  then  exit; 


end; 

end;  /*  of  Secondary  Copy  Protocol  ♦/ 


Figure  8:  Secondary  Copy  Execution  Protocol 

by  executing  the  election  protocol  (described  in  Figure  9)  and  a  new  primary  copy  continues  the  execution 
from  the  latest  checkpoint  Thus  a  replicated  object  does  not  stop  its  execution  in  case  of  site  failures  and 
the  execution  of  its  actions  is  guaranteed  to  complete  as  long  as  an  active  copy  exists. 

In  case  of  a  recursive  action  call  to  the  object  we  ensure  that  the  cq)y  which  starts  the  execution  is 
the  same  as  the  active  copy  of  the  parent  (^ration  of  this  call.  If  the  primary  copy  of  the  parent  operation 
had  failed,  the  call  will  be  blocked  till  the  new  primary  copy  reaches  the  state  in  which  the  parent  operation 
was  executed;  such  a  state  can  be  easily  identified  by  looking  at  the  call-id  of  the  recursive  call.  This  keeps 
the  behavior  of  the  replicated  object  the  same  as  that  of  an  unreplicated  one,  in  case  of  recursive  calls. 

When  an  object  is  acting  as  a  server  object,  it  retains  the  result  message  sent  to  the  client  for  an  invo¬ 
cation.  This  is  needed  as  the  client  might  be  replicated  and  in  case  of  failure  of  one  of  its  copy,  another 
copy  may  re-execute  the  invocation  operation  and  call  the  server  to  perform  the  same  action.  We  have 
adopted  this  idea  from  the  ISIS  design.  Thus  whenever  a  call  is  made  to  an  object  to  perform  certain 
action,  it  checks  whether  the  operation  was  performed  earlier.  If  it  was  performed  earlier,  then  it  simply 
sends  the  retained  value  for  that  action.  Hence  these  results  must  be  stored  by  the  server  object  till  it  is  sure 
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that  the  client  object  can  never  make  a  call  again.  This  can  either  be  done  when  the  client  object  has  com¬ 
pleted  all  the  operations  for  its  action,  and  has  stored  its  own  result  with  the  other  copies.  However  this 
may  not  be  very  efficient  A  more  efficient  way  would  be  to  store  results  for  a  certain  duration  and  discard 
them  after  some  long  time.  This  time  should  be  large  enough  so  that  the  operation  cannot  be  performed 
again. 

6.2.  Recovery  Protocols  «•  Election  of  A  New  Primary  Copy 

If  the  primary  copy  of  an  invocation  fails,  the  other  copies  are  informed  of  this  failure  by  their  object 
managers.  The  execution  of  the  secondary  copy  protocol  may  get  interrupted  if  either 
Primary  Copy  Failure  or  Election  Call  signal  is  received  (see  the  exception  handling  part  in  Figure  6). 
A  secondary  copy  execution  when  interrupted  because  of  these  signals  executes  the  Election  Protocol  (Fig¬ 
ure  9).  In  case  of  a  secondary  copy  failure,  other  copies  are  informed  and  they  just  need  to  update  the  list 
of  available  copies.  Among  all  the  copies  executing  the  Election  Protocol  one  of  the  available  copy  with 
the  highest  priority  (according  to  the  EMJist)  acts  as  the  election  manager  and  executes  the  Election 
Manager  Protocol  (Figure  10);  all  other  copies  execute  the  Election  Participant  Protocol  (Figure  11).  The 
function  I_AS4_EM  returns  true  if  and  only  if  the  copy  executing  this  function  is  eligible  to  act  as  the  elec¬ 
tion  manager.  After  the  completion  of  election  protocol  a  new  primary  copy  is  designated  to  carry  on  the 
execution. 

The  Election  Protocol  execution  may  get  interrupted  if  an  EM  Jailure  or  Election  JTermination  sig¬ 
nal  is  received.  In  case  of  the  first  signal,  a  copy  restarts  its  Election  Protocol  and  once  again  checks  if  now 
it  should  be  the  new  election  manager,  and  then  accordingly  executes  either  the  Election  Manager  Protocol 
or  the  Election  Participant  Protocol.  In  case  of  ElectionJTermination  signal,  the  id  of  the  new  primary 
copy  is  sent  to  all  copies  through  the  signal,  the  execution  of  the  Election  Protocol  is  terminated,  and  the 
Execution  Protocol  continues  executing  either  the  Primary  Copy  Protocol  or  the  Secondary  Copy  Protocol. 

The  election  protocol  is  structured  such  that  if  due  to  some  error  conditions  two  copies  start  execut¬ 
ing  the  Election  Manager  Protocol,  then  only  one  of  them  will  be  able  to  complete  the  election  success¬ 
fully.  The  three-phase  structure  of  the  election  protocol  is  very  similar  to  that  of  reliable  broadcast  proto¬ 
cols  described  in  [5).  A  copy  acting  as  the  election  manager  first  sends  a  signal  called  Election  Call  to  all 
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Procedure  ElectionProtocoI; 
var 

new_priinary  ;  Object  UID; 
successful  :  Boolean; 

EM  ;  Object  UID;  /♦  Current  Election  Manager  copy  ♦/ 

Procedure  ElectionManagerProtocol;  /♦  Figure  10  */ 

Procedure  Election  Participant  Protocol;  /*  Figure  11  ♦/ 

begin 

loop 

begin 

if  lAMEM 
then 

ElectionManagerProtocol; 
if  successful 

then  exit  /*  election  successfully  completed  */ 

else 

ElectionParticipantProtocol; 

exit; 

end  when 

EM_Failure; ; 

Election  Termination  (  Primary _Copy ):  exit; 
end;  /*  of  when  */ 

end; 

end; 

Figure  9:  Protocol  for  Electing  New  Primary  Copy 
other  available  copies  to  make  sure  that  all  those  copies  are  also  ready  to  participate  in  the  election.  This 
signal  is  sent  persistently  until  an  acknowledgement  is  received  from  all  the  other  copies.  An  election  parti¬ 


cipant  sends  an  ack  i  only  if  it  has  not  received  an  Election  Call  signal  from  some  other  higher  priority 
copy  acting  as  an  election  manager.  Otherwise  it  sends  back  a  nack .  When  a  participant  sends  an  ackx,  it 
sets  its  EM  variable  to  the  id  of  the  copy  which  sent  this  Election  Call  signal.  If  in  the  first  phase  any 
nack  is  received  by  the  Election  Manager  procedure,  its  execution  terminates  setting  the  global  variable 
success/ ul  to  false. 


In  the  second  phase,  the  copy  executing  as  an  election  manager  and  receiving  all  ack]  messages  in 
the  first  phase  requests  the  latest  checkpoint  numbers  from  all  the  participant  copies.  Meanwhile  if  some 
participant  has  received  an  Election  Call  from  some  other  higher  priority  copy,  its  EM  would  be  set  to  the 
id  of  that  copy.  Therefore,  in  response  to  the  latest  checkpoint  request,  a  participant  copy  sends  the  latest 
checkpoint  only  if  the  requester’s  ID  is  equal  to  EM ;  otherwise  it  sends  a  nack  The  Election  Manager  Pro¬ 
tocol  terminates  with  successful  set  to  false  if  any  nack  is  received  during  this  phase. 
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Procedure  Election  Manager  ProtocoI: 
var 

no  nack  received :  Boolean; 
participants  :  list  of  Extended  Object  UID; 
begin 
begin 

participants  ;=  avail  Copies; 
successful  :=  true; 

signal(Election_CalI)  to  participants; 
set_timer(timeout  j)eriod); 
remaining_copies:=  participants; 

StartTime  :=  clock; 

loop  /*  Phase  1  *l 

when 

received(ack)  from  a,  down(a): 

remaining  copies  :=  remaining  copies  -  [a]; 
if  remaining  copies  =  (|>  then  exit; 
timeout 

if  remaining-copies  *■  (> 

then 

signal(Election_Call)  to  remaining  copies; 
set_timer(timeoutj3eriod); 
else  exit 
end;  I*  of  when  *! 
end;  /♦  of  loop  *! 

participants  :=  participants  n  avail  copies;  /*  Phase  II  */ 

if  no  nack  received 

then 

sync_send(’latest_chkpnt?’, participants  ,elect_primary) 
if  no  nack  received 

then  (  /♦  mask  exception  handling  from  outside  blocks  *! 

sync_send(’Election_Success’,participants,all_acks);  I*  Phase  III  *! 
if  no  nack  received 

then 

when 

(clock  -  StartTime)  >  4T ; 
end; 

signal(Election_Terminate)  to  avail  copies;  I*  signal  to  all  up  copies  */ 
else 

successful  :=  false; 

signal(Election_Abort)  to  avail  copies;  1  /*  resume  exception  handling  *! 
else  successful  ;=  false; 
else  successful false; 
end  when 

Election  Call  from  a  where  priority(a)  >  priority(mycopy): 

EM  :=  a;  send(ac/:  0  to  a;  successful  :=  false; 

end; 

end;/*  of  ElectionManagerProtocoI  */ 

Figure  10:  Election  Manager  Protocol 
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The  Election  Manager  Protocol  enters  the  third  phase  if  no  nack  is  received  during  the  second  phase. 
In  this  phase,  the  election  participant  copy  with  the  largest  checkpoint  is  elected  as  the  primary  copy  It 
then  sends  a  message  ‘Election  Success'  to  all  the  participants.  A  participant  which  still  has  its  EM  set 
equal  to  the  id  of  the  sender  of  this  message  sends  an  acki  message,  otherwise  it  sends  a  nack .  Up  to  the 
point  of  sending  this  ack^  message,  a  participant  is  free  to  send  an  acki  or  the  latest  checkpoint  number  to 
any  other  higher  (Miority  copy  which  might  start  executing  the  election  manager  protocol.  However,  once 
the  ack^  in  the  third  phase  is  sent,  a  participant  has  to  wait  for  either  the  Election  Abort  or  the 
Election  J’ermination  signal  and  during  this  period  it  may  not  send  cuik  i  messages  to  any  other  election 
manager.  Therefore  no  Election  Call  signal  is  accepted  in  the  third  phase. 

AcceptaiKe  of  an  ElectionJTermination  signal  clears  all  pending  Election  Call  signals.  Similarly 
the  acceptance  of  an  Election  Call  signal  clears  all  pending  Election  Termination  or  Election  Abort  sig¬ 
nals.  If  the  election  completes  successfully,  the  Election  Manager  Protocol  sets  success/ ul  to  true  and  that 


Procedure  Election_Participant_Protocol: 
begin 

EM  :=  highest  priority  in  (avail _copiesr\EM_List); 

loop 

when 

Election  Call  from  copy  a : 

if  <z  has  the  highest  priority  in  (avail  copies r\EM  List) 
then  EM  :=  a;  send  (ack  i )  to  EM; 
else  send  (  nack  )  to  a; 
receive!  ’ Latest  chkpnt?  ’ )  from  a : 
if  EM  =  a 

then  send!  Latest  chkpnt )  to  EM; 
else  send!  nack  )  to  a ; 
receive  !  'Election  Success  ’ )  from  a : 
ifEM  =  a 
then 

send!  ackj )  to  EM; 

when 

Election  Terminate!  Primary  Copy  ):  exit;  /*  the  loon  and  terminate  election  ♦/ 
Election_ Abort  from  EM; 
end; 

else  send!  nack  )  to  a ; 
end  /♦  of  when  */ 
end;/*  of  loop  */ 

end;  /*  of  Election  Participant  Protocol  */ 

Figure  11:  Election  Participant  Protocol 
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causes  the  termination  of  the  Election  Protocol  at  the  election  manager  copy.  Whenever  the  Election 
Manager  Protocol  terminates  with  success/  ul  equal  to  false,  the  Election  Protocol  is  retried. 

One  should  note  here  that  a  successful  election  is  forced  to  last  at  least  47  units  of  time.  Also,  the 
election  termination  signal  is  sent  to  all  available  copies  and  not  just  to  those  which  participated  in  the  elec¬ 
tion.  This  makes  sure  that  any  copy  which  was  given  permission  to  join  the  configuration  before  the  start  of 
the  election  will  also  receive  this  signal.  Because  of  the  47  duration,  all  such  new  copies  will  be  in  the 
available  list  when  the  termination  signal  is  sent. 

It  is  possible  that  the  an  execution  of  the  Election  Manager  Protocol  may  get  interrupted  if  it  receives 
an  ElectionjCall  signal  from  some  other  higher  priority  copy.  In  this  case  the  Election  Manager  Protocol 
terminates  by  sending  an  ack  i  message  to  the  sender  of  the  signal,  and  setting  success/  ul  to  false  and  EM 
to  the  id  of  the  sender  of  the  signal.  (See  the  exception  handling  part  of  Figure  10.)  However,  during  the 
third  phase  all  exception  conditions  associated  with  the  outer  blocks  are  masked.  This  means  that  once  an 
election  manager  enters  the  third  i^ase,  an  EleciionjCall  signal  from  some  higher  pricxity  process  will  be 
kept  pending  during  this  phase.  If  the  election  completes  successfully  then  all  copies  including  the  waiting 
election  manager  will  also  receive  the  ElectionJTermination  signal  and  all  pending  ElectionjOall  signals 
will  be  cleared. 

7.  Correctness  of  the  Election  Protocol 

In  order  to  show  diat  this  recovery  protocol  functions  correctly  we  need  to  show  that  this  protocol  is 
(1)  free  from  deadlocks  and  livelocks,  (2)  once  an  election  is  started  eventually  only  me 
Election  JTermination  signal  will  be  generated,  and  (3)  the  same  election  signal  will  terminate  the  election 
protocol  at  all  participant  copies. 

The  absence  of  deadlocks  in  the  election  protocol  follows  from  the  following  observations.  Only  in 
the  following  situations  can  a  copy  hold  sending  a  response  message  to  another  copy  and  cause  it  to  wait. 
(1)  A  participant  which  has  sent  an  ack^  message  to  an  election  manager  waits  for  either  an 
Election  _Termination  or  Election  Abort  signal  and  does  not  send  any  response  to  other  election  managers 
which  have  issued  either  an  Electhn  Call  or  the  latest  checkpoint  request.  (2)  An  election  manager 
which  is  in  phase  III  holds  responding  to  any  Election  Call  signal  from  some  other  election  manager.  (3) 
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An  election  manager  in  phase  III  holds  sending  either  the  Election JPermination  or  Election  Abort  signal 
until  it  has  received  the  response  messages  from  all  participants  in  the  protocol.  Note  that  a  participant 
never  holds  response  messages  to  an  election  manager  which  is  in  the  third  phase.  To  show  absence  of 
deadlocks  we  show  that  only  a  higher  ordinality  process  can  hold  sending  responses  to  a  lower  ordinality 
process.  Assign  a  time-dependent  ordinality  to  a  process  as  a  pair  of  integers  <p  ,q  >  where  p  is  equal  to 
the  current  phase  number  for  an  election  manager  process  and  for  a  participant  it  is  the  highest  phase 
number  messages  it  has  received  from  and  responded  to  any  election  manager,  q  is  equal  to  0  for  a  partici¬ 
pant  and  1  for  an  election  manager.  The  value  of  this  number  is  interpreted  as  the  lexicogr^hic  ordering  of 
p  and  q .  Thus  an  election  managa*  in  phase  ni  has  ordinality  31  and  a  participant  which  has  sent  it  an  ack-i 
message  has  ordinality  30. 

The  proof  of  absence  of  livelocks  is  based  on  the  observation  that  the  sequence  of  values  assigned  to 
the  variable  EM  of  any  process  has  monotonically  increasing  values  (in  terms  of  priorities  assigned  to  the 
election  managers  in  the  EMJist),  unless  an  election  manager  goes  down.  Thus  if  two  or  more  election 
numagers  repeatedly  execute  their  protocols,  then  eventually  all  participants  will  have  their  EM  set  to  the 
id  of  the  highest  (xiority  election  manager  which  does  not  fail.  Eventually  only  that  election  manager  will 
complete  the  election  successfully  and  generate  the  ElectionJTermination  signal. 

Now  we  want  to  show  that  all  participants  will  terminate  their  election  because  of  this  signal.  The 
only  other  possibility  for  a  participant  to  terminate  the  election  protocol  with  its  current  election  manager 
(which  is  in  its  third  phase)  is  when  it  receives  the  EM _failure  signal.  If  the  election  manager  fails 
immediately  after  signaling  Electi  n  Termination,  dien  all  participants  will  receive  this  signal  with  in  the 
next  T  period;  moreover,  the  EMJ^ailure  condition  will  be  detected  and  signaled  only  2T  units  of  time 
after  the  failure  event.  Thus  all  participant  will  terminate  the  election  successfully  due  to  the  same  election 
termination  signal. 


Appendix:  Status  Monitoring  Protocol 

This  protocol  is  used  for  monitoring  the  up/down  sutus  of  some  set  of  processes  in  the  network.  The 
protocol  assumes  that  the  clocks  in  the  network  are  synchronized  using  some  network  clock  synchroniza¬ 
tion  protocol.  The  protocol  executes  in  synchronous  steps;  at  every  T  interval,  each  process  broadcasts  its 
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view  of  the  status  information  to  all  the  other  processes  in  the  network.  (Assume  that  T  »  the  maximum 
message  delay  in  the  network.)  This  status  information  consists  of  status  of  other  jx’ocesses  from  the  view 
point  of  this  sender.  For  each  process,  the  status  is  maintained  in  terms  of  three  colors:  white,  grey,  and 
black.  White  and  grey  colors  are  interpreted  as  up  status  for  that  process,  and  black  implies  down.  During 
phase  k,  a  process  u  computes  the  status  of  the  some  other  process  v  based  on  the  status  messages  it 
received  from  other  processes  in  phase  (k-l).  After  computing  this  new  status,  it  broadcasts  it  all  other 
processes. 


Phase  1: 


I*  mark  status  of  every  copy  in  the  configuration  as  grey  *1 
send(status_info)  to  others 

Phase  k>l: 

for  all  V  in  configuration 
do 

if  a  status  message  was  received  from  v  in  phase  k  -1 
then  mark  v  as  white 

else  if  any  status  message  (received  in  phase  k~l)  had  v  marked  as  white 
then  mark  v  as  grey 
else  mark  v  as  black 
end 

send(status_info)  to  others 
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Abstract 


Reaching  agreement  in  a  distributed  system  with  malfunctioning  components  is  an  impor¬ 
tant  issue  in  building  reliable  computer  systems.  Byzantine  agreement  in  a  distributed  environment  has 
been  an  important  problem  in  this  context.  Randomized  algorithms  for  reaching  Byzantine  Agreement 
were  proposed  by  Rabin^  and  further  studied  by  many  other  researchers.  These  algorithms  use  shared 
secrets  and  cryptogr^hy  to  teach  agreement  in  a  constant  expected  number  of  phases.  These  shared  secrets 
have  to  be  boot-loaded  during  system  indtadon.  This  may  become  a  serious  drawback  in  pracdcal  situa- 
dons.  In  this  pdper  we  present  an  algorithm  which  reaches  agreement  within  a  constant  expected  number  of 
phases,  independent  of  the  number  of  processes  in  the  system,  without  using  shared  secrets  and  digital  sig¬ 
natures.  The  only  shared  informadon  among  processes  is  the  knowledge  of  a  logical  configuradon  of  all 
processes  in  a  virtual  ring.  This  informadon  is  boot-loaded  with  every  process  during  system  inidadon. 


This  algorithm  can  overcome 


n-1 

"T" 


faults,  where  n  is  total  number  of  processes.  The  algorithm 


presented  here  achieves  Byzandne  agreement  on  binary  values.  However,  this  algorithm  can  be  easily  con¬ 
verted  into  muld-valued  Byzantine  agreement  using  the  techniques  described  by  some  other  researchers^- ' 


This  work  was  supported  by  a  grant  from  Rome  Air  Developemt  Center  under  the  Post-Doctoral  Fellowship  Program. 
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1.0  INTRODUCTION 


We  consider  a  set  of  n  communicating  pnxesses  P  =  {PiPz, . .  Pn}  with  initial  value  .  . 

Mk}-  The  processes  execute  an  agreement  algorithm  to  agree  on  some  common  value  m  from  the  set 
The  processes  are  assumed  to  be  connected  by  a  complete  graph  with  n  vertices  i.e  there 
is  a  communication  link  between  any  two  processes.  Tte  system  contains  proper  processes  which  follow 
the  algorithm  faithfully  and  faulty  processes  which  may  deviate  sometimes  maliciously,  to  prevent  an 
agreement.  We  say  that  the  system  is  proper  if  all  proper  processes  have  the  same  value  for  variable  M. 
Byzantine  agreement  is  defined  as  follows: 

(1)  If  the  system  is  proper  in  its  initial  state  with  m  as  the  initial  value  of  the  proper  processes,,  then  the 
algorithm  terminates  with  system  in  the  proper  state  with  value  m . 

(2)  If  the  system  is  not  proper  in  the  initial  state,  then  dte  algorithm  termiantes  with  system  in  a  proper  state 
with  any  value. 


The  Byzantine  agreement  has  been  an  extensively  studied  algorithm  for  the  past  few  years.  It  has 
been  proved  that  any  deterministic  Byzantine  agreement  cannot  be  reached  with  less  than  (t+1)  phases, 
where  t  is  the  number  of  faulty  processes  in  the  system  This  lead  to  randomized  algorithms  ^  for 
reaching  Byzantine  agreement,  which  reach  agreement  in  a  small  constant  expected  number  of  phases. 


Rabin’s  algorithm  can  tolerate  upto 


~r 


faulty  processes  in  the  synchronous  case  but  it  assumes  that 


some  messages  may  be  required  to  be  authenticated  by  digital  signatures.  Morever  it  assumes  a  presence  of 
a  stream  of  secretly  shared^  bits  distributed  by  a  non-faulty  dealer  at  system  initiation  time,  large  enough 
to  last  during  system’s  operational  life.  Other  researchers  have  imporoved  Rabin’s  algorithm  in  terms 
of  the  number  of  faulty  processes  that  can  be  tolerated  in  order  to  achieve  agreement  However,  all  these 
algorithms  still  require  shared  secrets  and  authentication.  This  turns  out  to  be  a  serious  drawback  of  the 
above  schemes.  Other  algorithms,  which  do  not  require  authenticaticn  and  shared  secrets  and  reach  Byzan¬ 
tine  agreement  within  a  constant  amount  of  expected  time,  can  survive  vmly  faulty  processes  ^ 


In  this  paper  we  propose  an  algorithm  that  can  tolerate  upto 


faulty  processes  in  synchronous 


cases  and  does  not  use  any  shared  sequence  of  bits  or  digital  signatures  for  authenticating  messages. 
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We  organize  the  remainder  of  the  paper  in  the  following  manner.  In  section  2,  we  briefly  give  our 
model  asssumptions.  In  section  3,  we  describe  our  algorithm.  In  section  4,  we  prove  that  the  expected 
number  of  phases  it  will  take  to  reach  an  agreement  is  a  constant,  independent  of  the  number  of  faulty 
processes. 

2.0  MODEL  ASSUMPTIONS 

Let  P=  {P\J*2,.  .,Pn}  proceseses  participating  in  the  protocol.  Assume  that  they  have  binary  values 
{Ml .M2 . Mn  }  before  the  execution  of  the  protocol.  These  processes  are  interconnected  by  a  totally  con¬ 

nected  networir.  We  assume  synchronous  communication  among  processes  which  implies  that  there  exists 
a  upper  bound  on  the  message  delays  between  any  pair  of  proper  processes.  We  define  a  phase  to  be  the 
interval  of  time  in  which  each  proper  process  is  able  to  communicate  with  all  other  proper  processes.  A 
round  consists  of  2  phases  of  information  exchange.  One  of  the  processes  is  designated  to  be  the  dealer  for 
a  particular  round.  This  dealer  is  decided  in  a  round  robin  fashion  for  each  consecutive  round  i.e  if  P 1  is 
the  dealer  for  the  first  round,  then  P  2  is  the  dealer  in  die  second  round  and  so  on.  For  this  purpose  all 
processes  form  a  virtual  ring,  and  the  knowledge  of  this  virtual  ring  configuration  is  the  only  shared  infor¬ 
mation  which  is  required  to  be  boot-loaded  with  every  process  during  system  initialization.  Thus  each  pro¬ 
cess  has  the  information  about  who  is  the  dealer  for  the  current  round  and  who  will  be  the  dealer  in  the 
later  round.  The  starting  dealer  for  every  agreement  run  is  also  chosen  in  a  round  robin  fashion.  Faulty 
processes  nuy  deviate  from  the  algorithm  and  may  maliciously,  and  possibly  in  collusion  with  each  other, 
try  to  jeopardize  die  agreement. 

Since  the  processes  are  interconnected  by  a  totally  connected  network,  each  processor  can  recognize 
the  sender  of  each  message  and  faulty  processes  cannot  change  the  message  of  a  non-faulty  processor. 
Each  message  is  of  the  form  (r,p,m)  where  r  is  the  round  number,  p  is  the  phase  and  m  is  a  binary  mes¬ 
sage. 

3.0  THE  ALGORITHM 

A  process  Pp  starts  the  algorithm  by  setting  the  value  of  variable  x  to  its  initial  value  Mp .  This  is  fol¬ 
lowed  by  R  rounds  of  message  exchange,  each  having  2  phases.  The  number  of  rounds  R  is  selected  on  the 
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Process  P 

x:=Mp 

for  Round  >1  to  R  do 
begin 

Phase  1:  Send  (Round,!^)  to  all  processes  including  itse^ 


If  (Round,  l,x)  recieved  from  more  than 
begin 

Decided  :=  true  ;  x:=m 
end 

else  Decided:-false; 

Phase  2:  If  p=dealer_for_this_round  then 
begin 

If  Decided  dien 

send  (Round,2,x) 
else 

begin 


n+t 

“2~ 


processes 


X  =  random(0,l);  {Function  random  returns  either 
0  or  1  with  probability  y} 

send(Round,2,x)  to  all  other  processes 


end 

else 

begin 

end 


end 


If  recieved  (r,2,m)  from  the  dealer_fOT_this_round  then 
If  not  decided  then  x:=m 


end; 


Algorithm:  Randomized  Byzantine  Agreement 


basis  of  desired  level  of  confidence  in  the  final  agreement  value  as  a  correct  agreement. 

In  phase  1,  process  sends  its  value  x  to  all  the  processes  and  waits  to  recieve  messages  from  every 
other  process.  If  a  message  is  not  recieved  within  a  particular  amount  of  time,  maximum  time  of  the  phase, 
it  is  assumed  to  be  0.  A  proper  process  is  supposed  to  send  a  response  within  the  maximum  time  of  the 


phase.  A  process  decides  on  a  particular  value  of  m  if 


messages  are  of  the  same  type  and  changes 


it’s  value  of  x  to  m.  Otherwise  its  state  is  undecided. 


In  phase  2,  if  the  process  is  a  dealer  and  if  it  is  decided  it  sends  its  value  of  x  to  everyone  else.  If  it  is 
dealer  and  undecided  it  randomly  chooses  between  (0,1)  with  probability  1/2  and  sends  this  value  to  every¬ 
one  and  updates  its  own  value.  If  the  process  is  not  a  dealer  and  undecided  for  that  round,  it  changes  its 
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value  to  the  value  recieved  from  the  dealer.  Otherwise  it  does  not  do  anything. 


The  total  number  of  messages  sent  by  all  the  processes  to  each  other  in  one  round  is  n^. 


Theorem  1:  An  asynchronous  system  with  n>3t  and  binary  initial  values  executing  the  above  algorithm 
will  satisfy  the  following 

1.  At  the  end  of  first  phase  of  every  round  two  proper  (nxxesses  cannot  decide  on  2  different  values. 

2.  If  the  system  is  proper  with  initial  value  m,  then  every  proper  process  terminates  the  algorithm  with 
x=m. 


3.  If  the  system  is  not  proper,  then  with  the  probability  of  at  least  1- 


1 

*(«-()  +  ( 

2L 

7 

n 

every  proper  pro¬ 


cess  terminates  the  algorithm  after  R  rounds,  where  k-R  div  n  and  1=R  mod  n. 


Proof 


(1) :  Suppose  P\  and  Pi  decide  on  Xi  and  respectively  at  end  of  phase  1  of  any  round  Then  Pi  must 
have  recieved  message  Xi  from  more  than  -2^  proper  processes.  Similiarly  Pi  must  have  recieved  mes¬ 
sage  Xi  from  more  than  -2^  proper  processes.  Thus  at  least  one  proper  process  must  have  sent  message 
Xi  toPi  and  X2to  Pi.  Contradiction. 

(2) :  Suppose  if  all  the  proper  processes  start  with  the  same  values  of  x  at  the  beginning  of  any  round.  Let 
the  value  be  m.  We  will  show  that  all  the  proper  processes  will  have  the  same  values  of  x,  equal  to  m,  at  the 
end  of  the  round.  Consequently,  once  an  agreement  is  reached  by  all  proper  processes,  this  agreement  will 
hold  after  each  subsequent  round.  Hence  if  the  system  properly  commences  with  value  m,  all  proper 
processes  will  agree  on  the  same  value  after  every  subsequent  round.  The  following  is  the  proof. 

After  the  end  of  the  first  phase  all  the  processes  will  recieve  (n-t)  messages  of  the  same  value.  Since 
(n-t)  >  -2^  as  n  >  3r .  Therefore  all  processes  wiU  decide,  on  the  same  value.  In  the  second  phase  the 
decided  processes  do  not  change  there  values,  independent  of  the  dealer’s  value. 

(3) ;  Let  us  consider  the  scenario  after  completion  of  the  first  phase  of  any  round  before  which  the  system 
was  not  in  a  proper  state.  The  proper  processes  can  be  classified  into  two  sets  D  and  N  referring  to  decided 
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and  undecided  processes  respectively.  By  end  of  the  first  phase  all  the  decided  processes  must  have 
agreed  on  one  particular  value  say  m .  There  are  3  cases 

1.  Dealer  is  proper  and  undecided 

2.  Dealer  is  proper  and  decided 

3.  Dealer  is  faulty 

In  Case  1,  the  dealer  will  generate  a  random  m  =(0,1)  each  with  probability  This  is  the  value 
which  will  be  assumed  by  all  undecided  processes  at  the  end  of  this  round.  If  this  value  matches  with  the 
value  of  the  decided  processes  with  a  probability  ,  all  proper  processes  will  have  the  same  value  and 
their  will  be  an  agreement  at  the  end  of  this  round. 


In  Case  2,  the  dealer  will  send  its  decided  value  to  everyone  and  this  value  will  be  assumed  by  all 
undecided  proper  processes.  Thus  at  the  end  of  this  round  there  will  be  an  agreement 

Thus  we  conclude  that  there  is  a  probability  of  at  most  of  not  reaching  an  agreement  at  the  end  of 

this  round  if  the  dealer  is  not  faulty.  Let  us  consider  a  sequence  of  y  dealers  (y  ^ ). 

D\JDi,  . .  X)y 

The  ivobability  of  x  dealers  being  faulty  out  of  these  y  dealers  is 


If  there  are  x  faulty  processes  out  of  these  y  processes  the  probability  of  not  reaching  an  agreement  after  y 


rounds  is  at  most 


.  Therefore,  the  probability  of  not  reaching  agreement  in  y  rounds  is  at  most 


/(>)  = 


y-x 


f(y)^ 


1 

T 

V. 


.21 


Thus  our  claim  is  true  for  y  <rt .  Let  us  consider  a  sequence  of  n  dealers 
D\JD2,  ■ . 


I 
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The  sequence  will  contain  at  least  n-t  proper  processes,  as  the  dealers  are  chosen  in  a  round  robin  fashion. 


times  the  pro- 


Thus  probability  of  not  reaching  an  agreement  at  the  end  of  n  rounds  will  be  at  most 

bability  of  disagreement  at  the  beginning  of  the  sequence.  Thus  for  y  -kn  +l  it  can  be  easily  shown  that  the 
probability  of  not  reaching  an  agreement  is  at  most 


*(«-!) 


14-^ 

n 


Thus  the  probability  of  reaching  an  agreement  is  at  least 


1 


14-^ 

n 


□ 


4.0  ANALYSIS 


Let  the  expected  number  of  round  for  reaching  agreement  be  E.  Let 


f(y)  = 


*(«-») 


14- 


21 


where  k=  y  div  n,  1  =  y  mod  n 

Let  P(y)  be  the  maximum  probability  of  achieving  agreement  in  y  round. 

/’(>’)  =  ((1-/ O' )H1-/ O'-!)) 

=/0'-i)-/0') 


E^ZyP(y) 


=1/0') 

y=o 


=1  +Z  + 

where  Z 


f  ^ 

r  ,1 

1 

Z+ 

2(»-0 


z... 


=.S;/(y) 


£  5  14-- 


1- 


■Tr=T 


<  l4-yZ  asN  >  3t  andt>0 


n4-2y4-2 

n+iy 


for  y  <  n 
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Z<p  +  +  . .  .  p" ,  -where 


P=T 


r 


Since  n  >  3t,  p  < 


.  Hence  p  <  0.975. 


Now  it  can  be  easily  shown  that  Z<39  a/u/£<46.  For  n  >  5t  the  values  are  3  and  4  respectively.  Note 
that  these  are  not  the  tightest  bounds  on  E.  With  some  more  calculations,  the  value  of  E  can  be  bounded  to 
39  and  3  respectively. 


The  following  theorem  states  that  for  the  case  n>St  it  is  possible  fc»'  a  process  to  detect  agreement 
when  the  system  has  indeed  entered  a  proper  state. 


Theorem  2  a)  For  n>5t,  if  in  round  k  some  process.  Pi,  finds  (n-t)  messages  with  the  same  value,  then  by 
round  /:+!  all  the  processes  in  the  system  will  be  able  to  detect  that  agreement  has  been  reached, 
b)  If  the  system  is  in  agreement,  then  every  process  will  find  (n-t)  messages  with  the  agreement  value. 

(The  proof  is  omitted.) 


5.0  CONCLUSION 


A  new  algorithm  for  achieving  fast  Randomized  Byzantine  Agreement  has  been  presented  in  this 
paper.  This  algorithm  does  not  require  any  secret  sharing  of  information  and  digital  signatures  to  authenti¬ 


cate.  This  algorithm  can  overcome  upto 


n-1 

1“ 


malicious  processors  and  reaches  agreement  in  constant 


expected  number  of  rounds.  The  number  of  messages  required  for  each  round  are  o(n^).  The  algorithm 
presented  here  achieves  Byzantine  agree^^ent  for  binary  values;  this  algorithm  can  be  converted  to  a 
multi-value  Byzantine  agreement  using  the  techniques  described  by  some  other  researchers'^’^. 


This  algorithm,  without  having  the  practical  drawbacks  of  earlier  algorithms,  takes  a  constant 
expected  number  of  rounds  to  achieve  agreement.  In  comparision  to  algorithms  which  do  not  use  shared 
secrets  and  cryptography,  it  has  a  much  better  bound  on  the  expected  number  of  rounds  required  to  achieve 
agreement. 
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1.  Introduction 

A  self-stabilizing  system  has  the  property  of  recovering  from  erroneous  (illegitimate )  states  by  continuing 
to  execute  its  actions  as  in  normal  states.  No  additional  mechanisms,  other  than  those  for  normal  opera¬ 
tions,  are  required  to  recover  from  errors  and  enter  a  legitimate  state.  Such  systems  were  first  introduced 
by  Dijkstra  [1]  and  were  called  self-stabilizing  systems.  In  this  p^r,  we  present  a  protocol  for  a  self- 
stabilizing  system  consisting  of  a  network  of  processes  in  the  fcam  of  a  binary  tree  and  prove  its  correct¬ 
ness. 

In  his  paper  [1],  Dijkstra  gives  three  protocols  that  achieve  self-stabilization,  two  of  which  are  for 
processes  connected  in  a  ring  and  one  for  processes  connected  in  a  chain.  In  his  design,  Dijkstra  defines 
legitimate  states  as  those  states  in  which  exactly  one  privilege  is  present  in  the  system.  Our  design  is  based 
on  Dijkstra’s  model  of  self-stabilizing  systems.  The  system  consists  of  a  network  of  processes,  each  of 
which  can  obtain  the  states  of  some  but  not  necessarily  all  of  the  processes  in  the  system.  A  process 
obtains  the  state  of  another  process  by  reading  the  contents  of  the  local  memory  of  that  process.  Each  pro¬ 
cess  decides  whether  it  has  the  privilege  to  make  a  move  (i.e.  local  state  transition);  a  privilege  is  defined 
as  a  Boolean  function  of  its  local  state  and  the  states  of  some  other  processes  in  the  network. 

An  action  by  a  process  consists  of  checking  whether  it  has  one  or  more  pnvileges  present,  and  if  so,  then 
selecting  one  of  the  privileges  and  executing  the  corresponding  move.  To  ensure  that  the  privilege 
corresponding  to  a  move  holds  when  that  move  is  executed,  each  such  action  is  assumed  to  be  executed 

This  work  was  supporied  by  a  grant  from  Rome  Air  Development  Center  under  the  Po.st-Doctoral  Fellowship  Program  (Gram 
No.  F-30602-81-C-0205). 
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atomically.  Alternatively,  this  consistency  can  be  ensured  by  a  "daemon"  [1]  which  selects  privileges  to  be 
executed  sequentially. 

In  a  self-stabilizing  system,  with  these  local  moves  made  by  the  individual  processes,  the  whole  system 
would  eventually  teach  some  globally  consistent  sute  (legitimate  state)  regardless  of  the  initial  system 
state.  In  addition,  if  by  accident  the  system  is  in  some  illegitimate  state,  the  local  moves  made  by  the 
processes  should  bring  the  system  back  to  some  legitimate  state  after  a  finite  number  of  moves. 

2.  The  protocol 

The  system  of  processes  are  connected  in  a  binary  tree-like  topology,  and  our  protocol  is  adapted  from 
Dijkstra’s  second  protocol  [1].  For  the  sake  of  clarity  in  presenting  the  protocol,  a  process  in  the  binary 
tree  is  assumed  to  have  either  no  children  or  exactly  two  sons.  The  protocol  can  be  easily  extended  to  elim¬ 
inate  this  restriction.  We  assume  that  each  process  knows  if  it  is  the  left  or  right  son  of  its  father.  A  left  son 
does  not  consult  its  right  sibling  in  making  its  move  while  a  right  son  needs  the  state  information  of  its  left 
sibling  to  decide  if  it  has  the  privilege.  Only  adjacent  processes  can  examine  each  other’s  local  memory  in 
order  to  obtain  the  state  information  of  its  neighbor.  When  a  process  is  examining  the  states  of  its  neighbor, 
the  neighbor  is  not  allowed  to  make  a  move. 

Each  process  has  two  boolean  state  variables  :  UP  and  S.  For  any  process,  the  state  variables  of  its  father, 
left  son,  right  son,  left  sibling  (if  those  nodes  exist)  are  denoted  by  UPp,  Sp ;  UPl  ,  Sl  ;  UPp ,  Sp  and  UPu , 
Sis  respectively.  By  definition,  the  root  has  its  UP  =  false  and  a  leaf  has  its  UP  =  true .  To  simplify  the 
protocol,  a  process  that  is  a  left  child  will  have  its  UPis  -  true  and  Sis  =  Sp  although  there  are  actually  no 
siblings  to  its  left 

For  the  root,  its  move  is  given  by 

if  UPt  and  UPp  and  (S  =  Sc)  and  (S  =Sp) 
then  S  ;=-i  S; 

For  an  internal  node  other  than  the  root,  its  move  is  given  by 

if -1  UP  and  UPi,  and  UPp  and  (S  =Sl)  and  (5  =Sp) 
then  UP  :=  true ; 
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if  UP  and  -<  UPf  and  UPls  and  {Sf  =  Sls)  and  (S  ) 
then  begin  UP  :=  false  ;S  :=Sf',  end; 

For  a  leaf,  its  move  is  given  by 

if -1  UPf  and  UPis  and  (S/?  =Sis)  and  (S  ^Sf) 
then  S  :=  Sf  i 

Observe  that  whenever  a  process  u  \as  its  UP  =  false ,  there  always  exists  at  least  one  privilege  in  the  sub¬ 
tree  rooted  at  u .  (This  can  be  easily  proved  by  induction  on  the  size  of  the  subtree  rooted  at  u .)  Now, 
UP„ot  =  false  by  definititxi.  Hence,  there  is  always  at  least  one  privilege  in  the  system.  In  addition,  if  at 
some  point  all  nodes  except  the  root  have  their  UP  =  true  and  5  =  Sm>t  at  the  same  instant,  the  system  is 
stabilized  and  will  remain  stabilized  if  the  protocol  is  observed.  In  this  case,  when  the  root  makes  a  move, 
the  privilege  is  passed  to  its  left  and  then  to  its  right  subtree  before  the  root  regains  the  privilege.  When  a 
node  u  *  root  obtains  the  privilege  from  its  father,  u  makes  a  move,  passes  the  privilege  to  its  left  and  then 
to  its  right  subtree  before  it  regains  the  privilege  from  its  sons  and  passes  it  up  again. 

Note  that  in  this  protocol  we  do  not  need  any  additional  mechanisms  such  as  a  "daemon"  or  locking  proto¬ 
cols  to  ensure  the  consistency  of  the  actions.  This  can  be  seen  from  the  fact  that  when  a  process  has  a 
privilege  present  based  on  the  states  of  some  other  processes,  then  those  processes  cannot  be  possessing 
any  privileges. 

3.  Correctness  proof  of  the  protocol 

Under  normal  situation  (i.e.  when  the  system  is  stabilized),  each  of  the  processes  in  the  binary  tree  takes 
turn  to  make  a  move  in  the  order  as  described  in  the  previous  section.  When  the  system  has  more  than  one 
privileges,  each  move  of  the  root  causes  a  "frontier”  of  stabilization  to  expand  and  diffuse  from  the  root. 
Eventually  this  frontier  covers  all  the  leaves  and  the  system  is  stabilized.  The  proof  proceeds  as  follows : 

Let  G  =  G(V,E)  be  a  binary  tree  with  at  least  three  nodes  representing  the  network  of  processes. 

Definition  Let  P  =  <vo.  vi_ vt>  be  a  path  in  G.  Then  P  is  an  up-path  if  vo  is  the  root  and  V  v, , 
1^’  <k ,  UPi  =  true  and  5,  =  -.  5o. 

Definition  An  up -tree ,  T,  is  a  subgraph  obtained  from  a  snapshot  of  G  taken  right  after  a  move  by  the 
root  which  satisfies  the  following  properties  : 
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(i)  roote  V(r). 

(ii)  u  £  V(7')  only  if  there  exists  an  up-path  to  u  in  that  snapshot. 


Lemma  1 


Proof 


Theorem  1 


Proof 


If  a  father  node  and  a  son  node  have  been  in  the  same  up-tree  once  and  since  then  they  make 
their  moves  according  to  the  protocol,  then  from  that  time  onwards,  UP  father  =  true  implies 
UP„a  =  true . 

Consider  the  first  move  (made  by  one  of  these  two  nodes)  that  results  in  a  state  contrary  to 
that  stated  in  the  lemma.  Before  this  move,  UP  father  =  UPma,  no  moves  that  are  allowed  in 
the  protocol  could  have  resulted  in  the  aforementioned  state.  □ 

For  any  two  successive  moves  by  the  root,  if  the  up-tree,  T i,  corresponding  to  the  first  move 
has  V(7' i)  V(G),  then  the  up-tree,  T2,  coresponding  to  the  second  move,  will  satisfy  V(7’i) 
cV(r2)and|  V(ri)|  <  |v(r2)|. 

Let  V  be  a  node  with  the  shortest  distance  from  the  root  in  G  such  that  v  e  V(7' i)  but  v 
V(7'2).  Let  V  ’s  father  be  u .  [  Observe  that  v  ^  root  since  by  definition  the  root  is  in  all  up- 
trees.  Also,  u  *  root  otherwise  by  the  fact  that  the  root  has  the  privilege  to  make  a  second 
move,  V  will  be  in  V(r2).  Finally,  v  e  \{T 0  inches  u  6  V(r  1).  ]  Let  Sroot  ’‘X  in  Ti.  In  72, 
Srao,  =  ->  x  since  successive  moves  made  by  the  root  always  flip  Snot  •  v  not  in  V(r 2)  implies 
UPe  =  false  or  5v  =  Snot  (=  -1  x)  in  72. 

Case  1.  UPe  =  false .  By  the  choice  of  v,  «  e  V(72).  Thus,  UP  a  =  true  in  T2.  By  lemma  1, 
UPe  =  true .  This  is  a  contradiction. 

Case!.  5»  =  -.  x.  Let  P  be  the  up-path  to  v  in  7j.  In  7i,  5»  =  -■  x.  This  implies  v  has  made 
an  even  number  of  moves  made  possible  by  an  even  number  of  privileges  obtained  from 
above.  The  number  of  moves  must  be  zero  for  otherwise  v  would  have  at  least  executed  two 
privileges  ftom  u  beftxe  the  snapshot  of  72  is  taken;  and  since  UP^  =  UPy  in  7 1,  this  implies 
u  has  at  least  executed  two  privileges  from  its  father.  Continuing  this  argument,  one  of  the 
sons  of  the  root  in  P  would  have  executed  at  least  two  privileges  obtained  from  the  root  after 
7i  but  before  72  is  formed.  This  is  impossible.  Hence,  v  has  executed  zero  privileges  ftom  u 
when  72  is  formed.  However,  u  has  executed  its  privilege  from  above  (as  witnessed  by  = 
x).  At  the  time  Su  =  x,  UPu  =  false;  UPu  cannot  turn  back  to  true  without  v  ’s  making  its 
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move.  This  contradicts  die  fact  tiiat  l/P«  =  true  as  implied  by  u  €  V(T 2). 

To  show  1  V(ri)  I  <  I  V(r2)  I .  Since  V(Ti)  ^  V(G),  there  exists  a  node  w  in  Ti  such  that  at 
least  one  of  its  two  sons  is  not  in  Ti.  For  the  root  to  execute  its  move  again  after  the  first 
move,  Sw  must  be  flipped.  When  it  is  flipped,  t/P„  =  false .  The  only  way  to  make  UP^  = 
true  again  is  to  have  both  sons  of  w  having  their  UP  =  true  and  S  =  Smoi  ■  The  sons  will  hold 
their  states  until  72  is  formed.  Thus,  1  VfTi)  I  <|V(7’2)1.  □ 

Lemma  2  Any  node  u  ^  root  with  UPu  =  false  will  eventually  have  UPu  =  true . 

Proof  By  induction  on  the  size  of  the  subtree  rooted  at  u . 

For  the  base,  consider  a  subtree  rooted  at  u  with  two  leaves  of  G  as  its  sons.  Since  UPl  and 
UPr  are  true  by  definition,  it  suffices  to  show  that  Sl  and  Sr  will  equal  Su  ■  (u  will  then  have 
the  privilege  to  move,  setting  UP^  =  true .)  This  follows  from  the  protocol  for  the  leaves. 
Now,  consider  a  subtree  rooted  at  u  with  size  larger  than  three.  Let  5^  =  x .  Consider  the  left 
subtree  of  u  rooted  at  v .  (The  right  subtree  can  be  handled  similarly.) 

Case  1.  5v  =  x  and  UP^  -  true,  v  holds  its  states  until  u  makes  a  move  by  which  UPu  = 
true. 

Case!.  S,  =  x  and  UP^  =  false.  By  induction  hypothesis,  UP„  will  eventually  be  true. 
Observe  from  the  protocol  that  this  move  made  by  v  does  not  affect  Sy .  Thus  Sy  =x  and  UPy 
=  true  and  this  becomes  case  1. 

Case  3.  5v  =  ->  X  and  UPy  =  true .  v  has  the  privilege  to  move,  setting  S»  to  x  and  UPy  to 
false.  Apply  case  2  ’s  argument. 

Case 4.  5,  =  ->  x  and  UPy  =  false.  By  induction  hypothesis,  UPy  =  true  eventually  without 
affecting  S, .  Apply  case  3  ’s  argument. 

With  both  sons  of  u  having  UP  =  true  and  S  =  Su,u  will  make  a  move,  setting  UPu  to  true . 

□ 

Lemma  3  For  any  node  u  *  root,  if  UPu  =  true  and  there  are  no  privileges  from  its  father,  the  subtree 
rooted  at  u  will  eventually  have  no  privileges. 

Proof  By  induction  on  the  size  of  the  subtree  rooted  at  u . 

Consider  a  subtree,  T,  rooted  at  u  with  two  leaves  of  G  as  its  sons.  Since  UPu.  UPl,  UPr  are 
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true ,  and  given  no  privileges  from  u ’s  father,  T  has  no  privileges.  This  establishes  the  base 
of  the  induction. 

Now,  consider  a  subtree  rooted  at  u  with  size  larger  than  three.  Consider  the  left  subtree  of  u 
rooted  at  V. 

Case  1.  f/Py  >  true,  v  has  no  privikges  from  u.  By  induction  hypothesis,  the  left  subtree  of 
u  will  eventually  have  no  privileges. 

Case!.  UP.,  =  false .  By  lemma  2,  UP,  will  become  true .  Apply  case  1  ’s  argument 
Similarly,  the  right  subtree  of  u  will  eventually  have  no  privileges.  Thus  the  subtree  rooted  at 
u  eventually  has  no  privileges.  □ 

Theorem  2  Within  finite  number  of  moves,  G  is  stabilized. 

Proof  If  the  root  executes  its  privileges  infinite  number  of  times,  the  successive  up-trees  formed  will 
eventually  covers  G  by  theorem  I.  At  that  point  G  is  stabilized.  Otherwise,  die  root  executes 
a  finite  number  of  moves.  We  will  show  that  this  cannot  happen.  Consider  the  last  move 
made  by  the  root.  Let  “  x  after  this  move.  The  left  son,  u ,  of  the  root  will  make  a  move, 
setting  UPu  «  false  and  Su  -  x.  By  lemma  2,  UP^  will  eventually  have  value  true.  And 
since  the  root  has  made  its  last  move,  u  cannot  obtain  another  privilege  from  the  root.  By 
lemma  3,  the  subtree  rooted  at  ti  eventually  has  no  privileges.  Similarly,  the  right  subtree  of 
the  root  will  eventually  have  no  privileges.  At  this  point,  only  the  root  has  the  privilege  to 
move,  and  it  will  make  a  move  provided  it  is  not  infinitely  lazy.  This  contradicts  the  finite 
number  of  moves  made  by  the  root  □ 
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1.  Introduction 

Topor  [4]  gives  an  elegant  derivation  of  a  termination  detection  algorithm,  however,  his  algorithm  impli¬ 
citly  requires  that  message  delays  be  bounded.  Otherwise,  the  algorithm  may  declare  termination  (of  a  dis¬ 
tributed  computation)  while  some  processes  are  sdll  actively  engaged  in  the  computation.  We  give  an 
upper  bound  on  the  message  delays  that  is  necessary  and  sufficient  for  Topor’s  algorithm  to  function 
correctly  in  a  system  of  processes  where  message  transmission  is  not  instantaneous  as  in  CSP  [2].  Next  we 
give  an  improved  (termination  detection)  algorithm  that  allows  arbitrary  (but  finite)  message  delays  in  a 
system.  Finally,  we  prove  the  correcmess  of  the  inqtroved  algwithm. 

Our  algorithm  is  similar  to  that  of  Misra  [3]  in  the  sense  that  we  use  special  messages  on  every  communi¬ 
cation  channel  to  flush  out  messages  in  transit  However,  our  algorithm  uses  a  spannng  tree  instead  of  a 
cycle  (consisting  of  all  channels  in  a  system)  as  is  used  in  [3].  We  assume  the  readers  are  familiar  with  the 
work  of  Dijkstra  [1]  and  Topor  [4].  (See  appendix  for  a  brief  description  of  Topor’s  algorithm.) 

2.  Preliminaries 

We  assume  there  is  a  set  of  processes  cooperating  in  some  computation.  The  processes  are  modeled  by  a 
gr^h  G  =  (V  j:)  where  V  represents  the  set  of  processes  and  E  the  communication  channels  among  the 
processes.  Processes  can  be  active  or  idle .  Only  an  active  ja'ocess  may  send  messages  to  its  neighbors  in 
G .  An  active  i»^ocess  may  turn  from  active  to  idle  at  any  time  but  an  idle  process  can  only  become  active 
on  the  receipt  of  some  message.  All  messages  ate  received  in  the  same  order  as  they  are  sent;  this  applies  to 

This  work  was  supported  by  a  grant  from  Rome  Air  Development  Center  under  the  Post-Doctoral  Fellowship  Program  (Grant 
No.  F-30602-81-C-0205). 
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all  types  of  messages.  A  distributed  computation  is  said  to  have  terminated  if  all  processes  are  idle  and 
there  are  no  messages  in  transit 


To  establish  an  upper  bound  on  message  delays  f(^  Topw’s  algorithm  to  function  correctly,  we  need  the 
following  notations.  Let  ; 

r  be  a  fixed  spanning  tree  in  G .  (The  tree  used  in  Topor’s  algCMithm.) 

For  any  node  u,v  in  V, 

lu  is  the  distance  (level)  of  u  from  the  root  in  T ; 

Tu  is  the  subtree  of  T  which  is  rooted  at  u ; 

hu  is  the  height  of  Tb  where  A*,  =  max  {l^  -1^  :w  e  V{Tu)  }. 

If  u  sends  a  message  to  v,  delay(u,v)i&  the  time  between  u’s  sending  and  v’s  receipt  of  the  message. 

5  (>  0)  is  the  minimum  message  delay  time  for  all  e  €  £ .  Thus,  5  <  delay  (u  ,v ). 

[a  ,b  ]  with  a  £b  is  the  time  interval  between  global  time  a  and  global  time  b ,  including  both  a  and  b  in 
the  interval.  Similarly,  [a,b)  is  the  interval  between  a  and  b,  including  a  but  excluding  b .  (a  J>],  {a ,  b) 
are  defined  analogously. 

Ri  (i  >  1)  is  the  global  time  when  the  root  sends  the  i*  wave  of  (repeat)  signals  in  Topor’s  algorithm  or 
when  it  declares  termination  after  (i  -  1)  waves  of  signals.  By  definition,  /?o  =  0. 

Tu,  is  the  global  time  when  it  is  the  i*  time  u  sends  a  token  up  to  its  father. 

Note  :  Fori  S  l,Tu,  <Ri,Ri  -tu,  S/u8,/?,-  <  Tu,„  and  tu,.,  -  S(/u  +2h^)&. 

The  last  inequality  comes  from  the  fact  that  u  cannot  send  its  token  up  until  it  has  received  a  sig¬ 
nal  from  above  (which  takes  at  least  45 ) ,  passed  signals  to  nodes  below  (which  requires  at  least 
Au5 )  and  got  new  tokens  from  nodes  below  (which  takes  at  least  Au5 ). 

If  before  the  distributed  computation  has  started,  all  nodes  are  white  with  leaf  nodes  having  white  tokens 
and  there  are  no  messages  in  transit,  it  is  possible  for  Topor’s  algorithm  to  declare  termination  without  the 
sending  of  any  (repeat)  signals  by  the  root.  This  can  happen  only  if  no  nodes  have  sent  any  messages 
throughout  the  computation.  In  reality,  this  is  unlikely  to  be  the  case;  some  messages  for  computational 
purposes  are  exchanged  before  termination,  hence  we  assume  without  loss  of  generality  that  the  root  will 
send  at  least  one  wave  of  signals  downwards.  In  the  following  discussion,  we  exclude  the  trivial  case  of  no 
signals  being  sent  by  the  root  and  assume  that  the  root  will  not  declare  termination  at  R  i,  i.e.  it  sends  at 
lea.st  one  wave  of  signals. 
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3.  Analysis  of  the  bound  on  message  delays  for  Topor’s  algorithm 

Lemma  1  For  Topor’s  algorithm  to  function  correctly  in  a  system  of  asynchronous  communicating 
processes,  it  is  necessary  that  for  any  distinct  u ,v  e  V^,  delay  (u,v)<  (/„  +  /„  +  2h^ )5. 

Proof  Suppose  not,  then  the  following  scenario  can  occur.  Letp,  ^  e  V  be  two  sons  of  the  root  in 
T.  Pick  u  €  V(Tp)  and  v  e  V(T^)  with  (m,v)  €  E.  u  sends  a  message  to  v  and  turns  black, 
u  then  immediately  sends  a  black  token  to  its  father  and  paints  itself  white.  Meanwhile,  all 
other  nodes  are  white  and  idle  and  the  root  has  received  white  tokens  from  every  child 
except  p .  Say  the  black  token  sent  by  u  takes  exactly  /„6  time  to  get  to  the  root .  After  that, 
the  root  initiates  a  new  wave  of  signals  siiKe  having  received  a  black  token  forbade  it  to 
declare  termination.  If  the  signal  reaches  v  in  exactly  /y  5  time  and  v  receives  all  white  tokens 
from  its  sons  in  2/iv5  dme,  then  (/„  +  L  +  2Ay)5  time  has  elapsed  since  u  ’s  sending  of  mes¬ 
sage  to  V .  By  assumption,  this  message  has  not  arrived  at  v  yet.  v  as  well  as  u  can  send  white 
tokens  up,  causing  the  root  to  declare  termination.  However,  v  can  become  active  after 
receiving  the  message  in  transit,  invalidating  root ’s  conclusion  of  the  system.  □ 

To  prove  the  sufficiency  of  the  above  bound,  we  examine  the  time  when  the  root  initiates  different  waves 
of  signals.  If  the  root  declares  termination  at  time  Ri+\  (i  >  1),  we  show  that  for  any  node  u,  ail  messages 
sent  by  u  before  time  Xu,  are  all  received  by  the  receiving  nodes  and  there  are  no  messages  sent  by  u  after 


u"mma  2  For  i  ^  1 ,  if  delay  (u  ,v )  <  (/„  +  /»  +  2h»  )8  and  the  root  does  not  declare  termination  at  Ri , 
then  any  messages  sent  before  x„,  from  «  to  v  are  received  by  Xy^ 

Proof  The  proof  is  by  induction  on  i .  For  i  =  1,  the  root  does  not  declare  termination  at  R  i  since 
we  assume  that  the  root  sends  at  least  one  wave  of  signals.  Hence  Xy^  exists.  From  the  Note 
in  the  previous  section,  Xy^  -  x„_  >  (/„  +  L  +  2hy)5.  By  the  fact  that  u  sends  the  message  to  v 
before  Xu,  and  the  bound  on  delay  (u  ,v ),  v  must  have  received  the  message  by  Xy^.  Thus,  the 
basis  of  induction  is  established. 

For  i  =  k  +  1,  assume  the  root  does  not  declare  termination  at  /?*+] .  By  induction  hypothesis. 
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we  only  need  to  consider  messages  sent  by  «  to  v  in  the  interval  ,).  Again  x»^^^  exists 
and  x»,^  -  x,^  _  >  (/„  +  /,  +2hy  )5.  Hence  by  time  x»^  v  should  have  received  all  messages 
sent  by  u  during  [O.x,^  ,)  because  of  the  bound  on  delay  (u  ,v ).  □ 

Lemma  3  If  the  root  declares  terminadon  at  /?.>i  (i  >  1),  then  no  node  u  has  sent  a  message  in  the 
interval  [x«^,  and  u  remains  idle  and  white  in  [Xa 

Proof  Assume  some  nodes  become  acdve  or  black  between  the  interval  of  their  last  sending  of 
tokens  up  and  the  declaration  of  termination  by  the  root  at /?<>].  Pick  from  these  nodes  a 
node  u  such  that  it  is  the  earliest  (in  global  time  after  /?,  )  to  be  activated.  Say  u  is  activated 
by  some  node  w .  By  lemma  2,  w  cannot  have  sent  the  wake-up  message  to  u  before  x^, . 
Also,  w  cannot  have  sent  the  message  in  [Xh.,,  x^,,,)  for  otherwise  the  root  cannot  have 
declared  termination.  Thus  w  has  turned  from  idle  to  active  in  [Xh.,,,,  ^,+i)  before  it  can  send 
the  message  to  u .  This  implies  that  w  is  activated  at  least  5  time  before  u .  This  contradicts 
the  choice  of  u .  □ 

By  ensuring  that  the  root  sends  at  least  one  wave  of  signals  together  with  the  bound  on  delay  (u  ,v),  lemma 
2  and  3  guarantee  that  when  the  root  declares  termination,  there  are  no  messages  in  transit  and  all  nodes 
are  inactive  and  white.  Thus  we  have  ; 

Theorem  1  Given  that  the  root  sends  at  least  one  wave  of  signals,  Topor’s  algorithm  will  function 
correctly  in  a  system  where  message  transmission  is  not  instantaneous  if  and  only  if  for  any 
two  distinct  nodes  u ,  v ,  delay  (« ,v )  <  (4  +  /,  +  2h,  )5. 

4.  An  improved  termination  detection  algorithm 

Given  G  =  {V ,E),  our  algorithm  will  use  a  fixed  spanning  tree  T .  The  root  of  T  will  be  responsible  for 
termination  detection. 

Definition  For  any  node  u ,  v  g  V  such  that  (u  ,v )  g  E, 

(a)  u  is  above  v  if  lu  <  I,. 

(b)  u  is  below  v  if  v  is  above  u . 

(c)  u  is  a  sibling  of  v  if  /„  =4. 
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Our  algorithm  is  a  modification  of  Topor’s  algorithm.  We  shall  use  four  colors  to  color  the  nodes  :  blue, 
grey,  black  and  white.  Roughly  speaking,  when  the  algorithm  has  been  initiated,  blue  and  grey  nodes  are 
those  that  have  seen  a  new  wave  of  signals  but  have  not  sent  their  tokens  up  while  black  and  white  nodes 
are  those  that  have  sent  their  tokens  up.  There  are  also  two  colors  for  the  token  :  black  and  white.  A  black 
token  means  there  are  possibly  messages  in  transit  and  it  is  not  safe  to  declare  termination  while  a  white 
token  represents  a  possibility  of  no  messages  in  transit  from  a  local  point  of  view  of  some  node.  Our  algo¬ 
rithm  consists  mainly  of  alternately  sending  a  wave  of  signals  from  the  root  downwards,  propagating  to  the 
leaves  and  a  wave  of  tokens  upwards  in  the  reverse  direction,  changing  the  color  of  nodes  if  necessary.  If 
the  root  has  received  white  tokens  from  all  nodes  below  and  is  idle  and  blue,  then  it  will  declare  termina¬ 
tion. 

Initially,  i.e.  before  the  computation  has  started,  all  nodes  are  white,  there  are  no  nodes  with  tokens  or  sig¬ 
nals  and  there  are  no  messages  in  transit.  The  algorithm  is  informally  given  as  a  set  of  rules. 

For  initiating  the  algorithm ; 

Rule  1  When  the  root  is  idle,  it  sends  signals  on  all  edges  to  nodes  below  and  paints  itself  blue.  (This 
rule  is  applied  only  once.) 

On  receiving  a  token  ; 

Rule  2  A  node  on  receiving  a  black  token  from  below  paints  itself  grey.  Otherwise,  there  are  no  changes 
to  its  color. 

For  sending  tokens  upwards  : 

Rule  3  A  node  u  other  than  the  root  (u  may  be  a  leaf  node)  sends  tokens  to  all  nodes  above  when  ; 

(a)  it  is  idle  and 

(b)  it  has  received  tokens  from  all  nodes  below  it  (this  is  trivially  satisfied  for  a  leaf  node)  and 

(c)  it  has  received  signals  from  all  its  siblings. 

The  color  of  u  and  the  color  of  the  tokens  u  sends  are  as  follows ; 

If  u  is  blue,  it  sends  out  white  tokens  and  paints  itself  white.' 

If  u  is  grey,  it  sends  out  black  tokens  and  paints  itself  black. 

After  sending  out  tokens,  u  loses  its  tokens. 

For  normal  signal  sending  ; 

Rule  4  After  the  root  has  received  all  tokens  from  below  and  is  idle,  if  it  is  blue,  it  declares  termination.^ 
Otherwise,  it  loses  all  its  tokens,  sends  signals  to  nodes  below  and  paints  itself  blue. 


'In  this  case,  all  tokens  received  by  u  must  be  white  by  Rule  2. 

^n  this  case,  all  tokens  received  by  the  root  must  be  white  by  Rule  2. 
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Rule  5  When  an  internal  node  other  than  the  root  has  received  signals  from  all  nodes  above,  it  waits 
until  it  is  idle,  sends  signals  to  its  siblings  and  all  nodes  below  and  paints  itself  blue. 

Rule  6  A  leaf,  on  receiving  signals  from  all  nodes  above,  waits  until  it  is  idle,  sends  signals  to  its  siblings 
and  paints  itself  blue. 

For  sending  messages : 

Rule  7  When  any  node  u  e  V  sends  a  message  along  any  of  its  incident  edge,  it  changes  color  as  fol¬ 
lows  : 

If  u  is  white,  it  changes  to  black. 

If  u  is  blue,  it  changes  to  grey. 

Otherwise,  there  are  no  changes  to  « ’s  color. 

5.  Correctness  Proof 

We  need  to  show  that  when  the  root  declares  termination,  all  nodes  in  V  are  white  and  idle.  If  this  is  the 
case,  we  claim  there  are  no  messages  in  transit  If  there  is  a  message  sent  by  u  in  transit,  either  it  is  sent 
before  the  last  time  u  turns  blue  in  which  case  the  message  should  have  been  flushed  out  by  the  signals  (if 
the  receiving  node  is  a  sibling  of  u  or  is  below  «)  or  tokens  (if  the  receiving  node  is  above  u)  u  sent  or  the 
message  is  sent  after  u  turns  blue  in  which  case  u  is  black  when  the  root  declares  termination  and  is 
impossible.  Hence  it  suffices  to  establish  for  the  above  algwithm  the  following  ; 

Theorem  2  When  the  root  declares  termination,  all  nodes  in  V  are  white  and  idle  in  the  system. 

Proof  Suppose  not,  then  there  exists  some  node  u  such  that  its  color  is  not  white  or  is  active.  If  u  is 
blue  or  grey,  then  the  root  could  not  have  declared  termination  for  u  has  not  sent  out  its 
tokens  yet.  If  u  is  white  and  active,  it  must  have  received  a  message  from  a  grey  or  black 
node  w  that  wakes  up  « .  If  iv  is  grey  right  after  sending  this  message,  the  root  could  not 
have  declared  termination  because  w  has  seen  the  last  wave  of  signals  and  will  send  black 
tokens  up.  Thus,  w  is  black  (right  after  sending  this  message).  If  w  was  grey  before  it  turns 
black,  again  the  root  could  not  have  declared  termination.  Hence,  w  must  have  changed  from 
blue  to  white  to  black.  Among  all  nodes  that  have  changed  from  blue  to  white  to  black,  con¬ 
sider  the  one  node  x  that  has  the  change  from  white  to  black  the  earliest  since  the  last  time 
the  root  sent  signals.  Since  x  had  been  white,  all  its  neighbors  have  seen  the  last  wave  of  sig¬ 
nals  from  the  root .  All  messages  sent  by  x’s  neighbors  to  x  before  they  turned  blue  the  last 
time  were  received  by  x  when  x  turned  white.  Hence  the  node  y  that  wakes  up  x  must  Have 
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sent  the  wake-up  message  to  x  after  it  turned  blue  but  before  it  sent  its  token  up  (this  follows 
from  the  choice  of  x),  i.e.  y  changed  its  color  from  blue  to  grey  to  black.  Again  the  root 
could  not  have  declared  termination.  □ 


It  can  easily  be  seen  that  if  the  distributed  computation  has  ended,  the  above  algorithm  will  eventually 
declare  termination. 
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Appendix :  Topor’s  Algorithm 

Given  a  graph  G  =  (V,£),  this  algorithm  will  make  use  of  a  fixed  spanning  tree  7*  in  C .  Initially,  each  leaf 
has  a  token. 


Rule  0  An  idle  leat  that  has  a  token  transmits  a  token  to  its  parent;  an  idle  internal  node  that  has  received 
a  token  from  each  of  its  children  transmits  a  token  to  its  parent;  an  active  node  does  not  transmit 
a  token.  When  a  node  transmits  a  token,  it  is  left  without  any  tokens. 

Hule  1  A  node  sending  a  message  becomes  black. 

Hale  2  A  iK)de  that  is  black  tv  has  a  black  token  transmits  a  black  token,  otherwise  it  transmits  a  white 
token. 

tie  3  A  node  transmitting  a  token  becomes  white. 

ix  le  4  If  root  has  received  a  token  from  each  of  its  children,  and  it  is  active  or  black  or  has  a  black 
token,  it  becomes  white,  loses  its  tokens,  and  sends  a  repeat  signal  to  each  of  its  children.  If  root 
is  white,  idle  and  tokens  received  from  all  its  children  are  white,  it  declares  termination. 

Rule  5  An  internal  node  receiving  a  repeat  signal  transmits  the  signal  to  each  of  its  children. 

Rule  6  A  leaf  receiving  a  repeat  signal  is  given  a  white  token. 
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1.  Introduction 
w 

With  the  increase  in  the  numb^  of  processing  elements,  computational  power  and  complexity  of 
interconnection,  the  reliability  and  fault  tolerance  of  distributed  systems  have  become  areas  of  acute  con¬ 
cern.  A  distributed  system,  like  any  other  digital  system,  is  subject  to  faults  which  make  it  deviate  from  its 
specified  behavior;  unlike  other  systems,  however,  most  distributed  systems  come  with  a  promise  to  deliver 
reliability  and  high  availability  of  resources  even  in  pathological  cases  of  processor  failures.  Indeed,  this  is 
one  of  the  main  reasons  for  having  distributed  systems  at  all.  To  suiqx>rt  the  claims  of  fault-tolerance  and 
reliability,  such  systems  need  to  have  the  ability  to  diagnose  faults  when  they  occur  and  initiate  recovery 
procedures. 

A  self-diagnosing  system  is  one  in  which  faults  can  be  isolated  to  within  replaceable  parts  of  the  sys¬ 
tem.  As  noted  in  [KimeSO]  fault  diagnosis  in  distributed  systems  is  generally  a  2-step  procedure-  it  involves 
both  the  detection  of  a  fault  when  it  occurs  and  its  location ,  before  repair  or  graceful  degradation  may  be 
initiated.  In  this  report,  our  principal  concern  is  with  the  latter.  We  refine  the  notion  of  fault  location  in  later 
sections,  but  it  may  be  pointed  out  that  the  purpose  of  fault  location  is  to  identify  a  superset  of  the  units 
which  must  definitely  be  faulty  to  cause  the  syndrome  whicl.  afflicts  the  system. 

Several  models  have  been  propo.scd  for  fault  diagnosis  in  distributed  systems.  In  this  report,  we  res¬ 
trict  ourselves  to  just  one  model  and  its  offshoots.  We  illustrate  the  salient  features  of  this  model  with 
examples  and  explore  the  interesting  theoretical  contributions  of  this  model.  Finally,  we  identify  a  host  of 
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unsolved  problems  with  our  own  preliminary  results  and  map  the  course  of  fuuire  research  to  be  done  in 
this  area. 

2.  The  PMC  model 

In  the  two  decades  since  the  introduction  of  the  so-called  PMC  model  [Prep67]  ,  significant  progress 
has  been  made  in  the  development  of  the  theory  associated  with  the  model.  The  reasons  for  this  are  not 
difficult  to  find:  first,  the  simplicity  and  elegance  of  this  model  have  made  it  appealing  to  both  theoretical 
scientists  and  system  designers  alike.  Second,  the  universality  of  the  model  makes  it  suitable  for  capturing 
the  essence  of  diagnosis  in  distributed  systems.  Third,  the  model  permits  a  level  of  abstraction  where  faulti¬ 
ness  of  a  processor/software  module  is  reduced  to  being  just  2-valued:  faulty  or  non-faulty.  While  this  may 
not  be  the  best  picture  of  "real-world"  processors  and  modules,  which  are  susceptible  to  a  fuzzy  behavior 
between  the  two  extremes  of  being  faulty  and  non-faulty,  this  simplification  has  the  beneficial  effect  of 
enabling  diagnosis  algorithms  developed  for  the  PMC  model  to  be  ai^licable  in  all  layers  of  the  system 
architecture.  Indeed,  one  may  drop  the  distinctions  between  hardware  {xocessors  and  software  modules  and 
just  talk  of  units  when  modeling  the  system  for  the  purpose  of  fault  diagnosis. 

A  distributed  system  is  one  in  which  computational  tasks  are  performed  by  multiple  units.  There  are 
2  principal  assumptions  which  are  made  in  the  PMC  model  about  the  distributed  system  being  modeled: 
first,  the  variables  which  characterize  the  fault  behavior  of  units  -  Mean  Time  Between  Failure,  Mean  Time 
To  Repair  inter  alia-  are  assumed  to  be  independent  random  variables;  second,  the  units  are  assumed  to 
have  the  capability  to  administer  tests  among  themselves  fm'  the  purpose  of  diagnosis.  We  concern  our- 
si.  ^es  neither  with  the  precise  nature  of  these  tests  nor  with  how  and  when  they  are  acuially  administered, 
beyond  insisting  that  they  be  complete ,  i.e.  a  fault-free  unit  should  always  correctly  identify  the  units  it 
tests  as  being  faulty  or  fault-free. 

The  system  that  is  to  be  diagnosed  is  partitioned  into  logical  units.  These  units  need  not  be  similar  in 
'  their  functionality  within  the  distributed  system,  except  in  their  ability  to  test  singly  or  in  ■'ombination, 
another  one  of  the  units.  The  outcome  of  the  tests  may  be  classified  simply  as  "pass"  or  "fail",  indicating 
that  the  testing  unit  evaluates  the  tested  unit  as  being  fault-free  or  faulty,  respectively.  We  assume  that  the 
evaluation  is  meaningful  only  if  the  testing  unit  iuself  is  fault-free,  otherwise  the  outcome  is  unreliable.  This 
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assumption  is  known  as  the  symmetric  invalidation  assumption  and  forms  the  basis  of  many  offshoots  of 
the  PMC  model.  An  alternative  assumption,  which  is  also  frequently  used  (and  considoed  by  some  to  be 
more  representative  of  real  systems  and  real  tests)  is  the  asymmetric  invalidation  assumption,  wherein  a 
"pass”  outcome  necessarily  implies  that  the  tested  unit  is  fault  free,  but  a  "fail"  outcome  only  implies  that 
either  the  tested  unit  or  the  testing  unit  or  both  are  faulty.  The  rationale  behind  this  assumption 
[Bars76,  KimeSO]  is  that  a  complete  test  in  systems  composed  of  complex  units  entails  the  checking  of  a 
large  number  of  responses  from  the  tested  system,  raidering  it  extremely  unlikely  that  the  faults  in  the  units 
performing  the  test  would  completely  cancel  the  faults  in  the  unit  under  test  causing  a  test  to  pass,  when  it 
should  have  failed. 

The  test  system  is  modeled  as  a  directed  graph  G  (V  ,£ ),  with  the  units  rq>resented  by  vertices  in  the 
graph  and  the  tests  by  directed  edges.  Thus  if  <i ,  y  >  is  a  directed  edge  from  unit  Ui  to  unit  uj ,  the  unit  Ui 
tests  the  unit  uj .  This  directed  graph  is  called  the  connection  assignment  of  the  system. 

Weights  are  assigned  to  edges  depending  on  the  test  outcomes:  the  weight  aij  associated  with  the 
edge  <i,  j>  is  defined  as  foUows;- 


if  Ui  tests  Uj  with  outcome  pass 
if  Ui  tests  Uj  with  outcome /oi/ 


V. 

The  set  of  test  outcomes  Oij  is  called  the  syndrome  of  the  system.  There  are  2'®'  possible  syn¬ 
dromes  for  any  connection  assignment. 

We  use  an  example  to  motivate  the  discussion  and  definitions  which  follow. 


Example  1 

Consider  a  system  composed  of  5  units  Ui,U2,...-U5  whose  connection  assignment  is  in  the  foim  of  a  ring.  A 
synd.  lie  for  the  system  can  be  represented  as  a  5-bit  vector  (oi2/r 23,0 34.®45>®si)- 
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Assume  exactly  one  of  the  units,  say  Ui,  is  faulty.  Then 


023  =  ^*34  =  045  =  0;  O51  =  1 

i.e.  Us  conectly  identifies  U 1  as  being  faulty,  and 

ai2  =  x  i.e.  Oor  1 

since  u  i,  being  faulty,  may  or  may  not  diagnose  U2  properly. 

Thus  the  syndrome  for  exactly  one  of  the  units  being  faulty  can  be  only  of  the  form 

(.xOOOl) 

or  one  of  its  cyclic  permutations.  Moreover,  it  can  easily  be  proved  that  the  connection  assignment  is  capable  of  identi¬ 
fying  the  faulty  unit  when  exactly  one  unit  is  faulty .□ 

The  example  raises  the  question  of  whether  the  modeled  system  is  capable  of  identifying  more  than  1 
fault  This  question  is  ambiguous  on  3  counts  since  the  system’s  "capability  to  identify  more  than  one  fault" 
is  open  to  3  different  interpretations: 

First,  it  could  mean  the  capability  of  locating  upto  t  faults  (/  >  1)  instantly,  i.e.  with  just  one  syn¬ 
drome.  Note  that  an  upper  bound  on  the  number  of  faults  is  necessary  since  the  entire  set  of  units,  V,  is 
always  a  consistent  fault  set  for  any  possible  fault  set  Moreover,  if  a  system  has  the  ability  to  identify  fault 

sets  unequivocally,  this  necessarily  implies  that  the  number  of  faulty  units  is  less  than 

Second,  it  could  mean  the  ctqiability  to  identify  at  least  1  fault  if  the  number  of  faulty  units  do  not 
exceed  t .  In  this  case,  we  would  be  able  to  identify  the  remaining  faulty  units  after  replacing  the  faulty  unit 
with  a  non-faulty  one  and  repeating  the  test  This  might  involve  as  many  repetitions  of  the  test  as  the  size  of 
the  fault  set,  and  therefore  represents  an  approach  which  is  a  compromise  between  having  a  small  number 
ot  >'.st  links  and  perftxming  a  small  number  of  tests. 

Third,  the  word  "identification"  could  be  interpreted  as  meaning  "locating  within  a  set  which  con¬ 
tains  the  units  sought".  Thus,  we  could  be  questioning  the  system’s  ability’  to  isolate  the  set  of  faulty  units 
to  be  within  a  larger  set,  for  we  could  then  replace  the  larger  set  with  impunity,  knowing  that  aU  the  faulty 
units  have  been  replaced,  albeit  with  some  fault-free  units.  This  represents  the  system  designer’s  willing¬ 
ness  to  sacrifice  a  few  good  units  for  the  sake  of  getting  a  fast  response  time  in  the  fault  detection  and  loca¬ 
tion  phase  before  initiating  system  repair  or  graceful  degradation. 
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All  three  interpretations  are  equally  valid  and  give  rise  to  3  different  diagnosability  measures  of  a 
system. 

Definition  1 

A  system  of  n  units  is  one-step  t -fault  diagnosable  (or  simply  t-diagnosable)  if  all  faulty  units 
within  the  system  can  be  identified  without  rq}lacement  provided  the  number  of  faulty  units  present  does 
not  exceed  t . 

Definition  2 

A  system  of  n  units  is  sequentially  t-diagnosable  if  at  least  one  faulty  unit  can  be  identified  without 
replacement  provided  the  number  of  faulty  units  present  does  not  exceed  t . 

Definition  3 

A  system  of  n  units  is  k-step  t  Is -diagnosable  if  by  no  more  than  k  applications  of  the  diagnostic 
a  set  of  size  not  more  than  s  units  can  be  identified,  such  that  all  faulty  imits  are  within  the  set,  pro¬ 
vided  the  number  of  faulty  units  present  does  not  exceed  t . 

If  a  system  is  1-step  i/s -diagnosable,  we  say  simply  that  it  is  i/s -diagnosable. 

Example  2 

Consider  the  system  of  Example  1  again.  The  syndrome  (xOOOl)  is  compatible  with  2  possible  fault  sets: 
{UxJU't}  3nd  {u\}.  Therefore  the  system  is  not 2-diagnosable.  However,  it  is  sequentially  2-diagnosable  as  we  demon¬ 
strate  below: 

Since  there  are  no  more  than  2  faults  in  the  system,  there  must  always  be  two  fault-free  processors  adjacent  to 
each  other.  So  there  is  definitely  some  edge  with  weight  0,  no  matter  what  the  syndrome  is.  Assume,  without  loss  of 
generality,  that  Uo,i  is  0. 

Case  1  Suppose  that  as.i  is  the  only  link  with  a  1  weight  Then,  if  a  i  is  assumed  to  be  fault-free,  we  have  to  conclude 

that  U2,U’i . as  are  all  fault-free.  But  then  <?s,t  =  1  implies  that  at  is  faulty,  which  is  a  contradiction. 

Hence,  in  this  case  a  i  must  be  faulty. 

Case  2  Suppose  that  a  5.1  is  also  a  link  with  a  0  weight  Then  if  a  2  is  assumed  to  be  faulty,  we  have  to  conclude  that 
both  U\  and  as  are  also  faulty.  But  then  we  have  more  than  2  units  which  are  faulty,  which  contradicts  the 
assumption  that  there  are  at  most  2  faulty  units  in  the  system.  Hence,  a  2  must  be  fault-free  and  we  just  have  to 
follow  the  links  after  U2  around  the  ring  until  we  come  to  a  link  with  a  weight  of  1 .  The  unit  that  this  link  pwints 
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to  must  be  faulty. 

Case  3  Suppose  that  05^1  is  not  the  only  link  with  a  1  weight.  Then  consider  each  of  the  possible  following  sub-cases 
distinguished  by  the  5-bit  vector  X  =  {fl5,i,au, . . .  ,4144).  In  each  of  the  sub-cases  the  assumption  that  not 
more  than  2  processors  are  faulty  inunediately  leads  to  the  idoitification  of  a  faulty  processor. 

X  =  (10001):  Then  Us  must  be  faulty. 

X  =  (10010):  Then  U4  must  be  faulty. 

X  =  (1(X)1 1):  Then  U4  must  be  faulty. 

X  =  (10100):  Then  us  niust  be  faulty. 

X  =(10101):  Then  U3  must  be  faulty. 

X  =  (101 10):  Then  U3  must  be  faulty. 

X  =  (10111):  Then  U3  must  be  faulty .□ 

3.  Main  Results  and  Open  Problems 

There  are  three  main  kinds  of  problems  in  the  distributed  system  diagnosis  environment  as  en¬ 
visaged  by  the  PMC  model,  each  of  which  is  discussed  below  with  a  summary  of  the  main  results. 

3.1.  The  Characterization  problem 

It  is  natural  to  ask  what  kind  of  system  architecture  ot  interconnection  strategy  is  to  be  adopted 
to  design  a  disuibuted  system  having  a  certain  diagnosability.  Since  we  have  three  different  diagno- 
sability  measures,  this  question  then  is  really  3  different  characteriration  problems: 

l.The  t -characterucUion  problem:  Given  a  certain  t,  (i.e  an  upper  bound  on  the  number  of 
faulty  units)  what  are  necessary  and  si^icient  conditions  for  a  distributed  system  to  be  t- 
diagnosable? 

Note  that  this  problem  (and  as  a  matter  of  fact,  the  other  2  problems  below)  can  be  answered  in 
purely  graph-theoretical  terms,  which  is  yet  another  vindication  of  the  PMC  model.  A  system 
designer  can  assure  himself  of  a  certain  amount  of  diagnosability  if  a  complete  characterization  is 
possible,  for  then  he  only  has  to  ascertain  that  the  connection  assignment  for  his  system  satisfies  the 
requirements  of  diagnosabiUty. 

The  first  complete  t  -characterization  appeared  in  [Haki74]  We  give  below  a  simpler  version 
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([Sull84.Alla75] ): 


A  directed  graph  G(y  ,E)  is  t-diagnosable  if  and  only  if 
VZcV  Z^0  =>  ^+|r‘Z|>/. 

An  application  of  this  theorem  reveals  die  following  facts  about  t-diagnosable  systems;  first, 
the  number  of  units  must  be  at  least  2t+l,  else  the  system  cannot  be  t  -diagnosable;  second  each  unit 
must  be  tested  by  at  least  t  other  units,  else  certain  scenarios  of  t-fault  situations  wiU  remain  undiag- 
nosable. 

Designs  for  t-diagnosable  systems  which  are  symmetric  (i.e  the  interconnection  strategy  is  the 
same  for  every  processor)  and  optimal  (i.e  using  the  minimum  number  of  testing  links)  have  been 
achieved  in  the  so-called  D  5^  systems  ([PrepfiT]  ). 

2.  The  sequential  t -characterization  problem:  Given  a  certain  t,  what  are  necessary  and 
sufficient  conditions  for  a  distributed  system  to  be  sequentially  t  -diagnosable? 

The  first  p^r  on  the  PMC  model  introduced  the  concept  of  sequential  diagnosability  and 
gave  very  weak  necessary  conditions.  It  also  dealt  with  the  special  class  of  single-loop  systems  and 
gave  a  complete  characterization  for  sequential  diagnosability  in  such  systems.  (Interestingly 
enough,  at  the  time  the  paper  was  published,  the  authors  did  not  realize  that  their  characterization  of 
single  loop  systems  was  complete.  In  a  separate  paper  [Prq768)  ,  Preparata  demonstrated  that  the 
"sufficient"  conditions  were,  in  fact,  necessary  also).  Since  then,  the  sequential  (-characterization 
problem  has  been  solved  for  other  special  classes  of  systems  (see  for  example,  [Kanj79]  );  however, 
as  far  as  can  be  determined,  this  problem  still  remains  (^n  for  general  connection  assignments. 

3.  The  t  Is -characterization  problem:  Given  a  certain  t  and  s,  what  are  necessary  and 
sufficient  conditions  for  a  distributed  system  to  bet  Is  -diagnosable? 

This  is  another  problem  for  which  there  appears  to  be  no  published  solution  for  arbitrary  con¬ 
nection  assignments.  Again,  characterization  for  special  classes  of  systems  have  been  achieved.  For 
example,  Karunanithi  and  Friedman  have  addressed  single-loop  systems  in  [Karu79]  ;  Chwa  and 
Hakimi  have  characterized  the  so-called  systems  and  the  restricted  class  of  systems  which  are 
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///-diagnosable  ([ChwaSl]  );  Sullivan  in  [Sull84} ,  has  given  (without  a  {M’ooO  a  characterization  of 
t/r+l-diagnosable  systems  (which  because  of  a  typographical  error  is  actually  incorrect  as  stated).  It 
also  spears  that  no  significant  work  has  been  done  in  designing  optimal  or  sub-optimal  arbitrary 
t  Is  -diagnosable  systems,  although  it  must  be  mentioned  again  that  for  special  classes  of  systems,  this 
issue  has  been  settled  satisfactcsily  (see  for  example,  [ChwaSl] ). 

3.2.  The  diagnosability  problem 

The  flip  side  of  the  characterization  problems  are  the  corresponding  diagnosability  problems, 
one  for  each  diagnosability  measure  defined  above.  Instead  of  asldng  for  the  characterization  of  a 
distributed  system  which  supports  a  certain  diagnosability  measure,  here  we  assume  that  the  inter¬ 
connection  strategy  for  diagnosis  (i.e  the  connection  assignment)  has  already  been  given  to  us  and 
want  to  determine  the  maximum  diagnosability  that  it  can  support 

t-diagnosability:  Allan  et  al.  [Alla7S]  introduced  the  concept  oi  the  diagnosability  number, 
which  is  the  largest  number  of  faults,  t,  that  a  system  with  a  given  connection  assignment  G(VJE) 
can  toloate  and  still  remain  t -diagnosable .  For  about  a  decade  or  so,  only  algorithms  with  time 
complexities  exponential  in  n ,  the  number  of  units  in  the  system,  were  known  for  determining  the 
diagnosability  number  for  arbitrary  connection  assignments;  recently,  Sullivan  ([SulI84] )  used  net¬ 
work  flow  techniques  to  obtain  a  remarkable  0(n^)  algorithm,  thus  scotching  the  suspicion  that  a 
polynomial  time  algorithm  did  not  exisL 

Sequential  t-diagnosability.  The  corresponding  algorithm  for  determining  the  sequential  t- 
diagnosability  number  for  arbitrary  connection  assignments  has  not  appeared  in  the  published  litera¬ 
ture  so  far.  In  fact,  it  is  not  even  known  whether  this  problem  is  tractable  (in  terms  of  polynomial 
time  solvability)  or  not 

t/s-diagnosability.  Sullivan,  in  his  landmark  paper  [Sull84]  on  a  polynomial  time  algorithm  for 
r -diagnosability,  also  proved  that  r/s -diagnosability  is  co-NP  complete,  i.e  the  following  decision 
problem 
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Given  a  directed  graph  G(VJE)  and  positive  integers  t  and  s,  is  G  t/s-diagnosable? 

cannot  be  answered  in  time  polynomial  in  n ,  unless  the  old  debate  of  whether  P  equals 
NP  ([Gare79]  )  is  settled  affirmatively.  Actually,  negative  results  like  this  abound  in  the  diagnosis 
area  (see  for  example  [Mahe76.Fuji78]  and  [Soma86] );  nevertheless,  the  theoretical  importance  of 
such  results  cannot  be  gainsaid.  Polynomial  time  algorithms  have  been  achieved  for  q)ecial  classes 
of  r/r-diagnosable  systems:  in  [Sull84} ,  for  example,  is  an  algorithm  fw  r/r-diagnosable  systems, 
which  can  be  generalised  to  yield  efficient  algorithms  for  r/r+k-diagnosable  systems  as  long  as  is 
much  smaller  than  t . 

3J3.  The  diagnosis  problem 

Even  when  a  system  has  a  demonstrable  amount  of  diagnosability,  there  still  remains  the  prac¬ 
tical  problem  of  diagnosing  faults  when  they  occur.  This  is  perhaps  the  area  of  utmost  concern  to  the 
system  designer  and  surprisingly,  is  the  area  where  there  are  the  fewest  results.  Again,  we  examine 
the  main  results  in  diagnosis  algorithms  under  the  3  measures  of  diagnosability; 

t-diagnosis:  Given  a  directed  graph  G(VM)  "which  i -diagnosable,  and  a  syndrome  in  a  t-fault 
situation,  find  the  unique  fault  set  consistent  with  the  syndrome. 

The  above  problem,  which  is  as  old  as  the  PMC  model  itself,  eluded  all  attempts  at  an  efficient 
solution  for  17  years.  Kameda  et  al.  [Kame7S] ,  gave  an  0(n^)  algorithm  for  it,  but  as  pointed  out  in 
[Corl76]  ,  the  algorithm  was  flawed  by  a  technical  error,  which  nevertheless  does  not  render  it  unus¬ 
able  ({Madd77]  ).  Only  very  recently  ([Dahb84]  ),  did  Dahbura  and  Masson  come  up  with  a  really 
elegant  0(n^*)  algorithm  which  uses  the  concepts  of  maximal  matching  and  minimum  vertex  covers 
in  undirected  graphs.  As  such,  ail  3  problems  (charact^zation,  diagnosability  and  diagnosis)  for  the 
t  -diagnosability  measure  can  be  considered  to  be  completely  solved. 

Sequential  t-diagnosis:  Given  a  directed  graph  G(V,E)  which  is  sequentially  t-diagnosablc , 
and  a  syndrome  in  a  t-fault  situation,  find  at  least  one  unit  which  is  definitely  faulty,  regardless  of 
which  set  of  units,  among  all  the  possible  consistent  fault  sets  for  the  syndrome,  actually  caused  the 
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syndrome. 


As  mentioned  earlier,  the  characterization  and  diagnosability  jx-oblems  for  this  class  of 
directed  graphs  are  still  unsolved  in  the  literature.  Unfortunately,  the  same  is  true  of  diagnosis. 
Except  for  the  restricted  classes  of  single-loop  and  D5.,  systems  ([Prep67]  and  IKaru79]  respec¬ 
tively),  no  attempt  has  been  made  to  solve  the  sequential  t-diagnosis  problem  for  systems  with  arbi¬ 
trary  connection  assignments. 

tis-diagnosis:  Given  a  directed  graph  which  is  t/s-diagnosable,  and  a  syndrome  in  a  t-fault 
situation,  locate  a  set  X  of  cardinality  no  greater  than  s,  such  that  every  consistent  fault  set  for  the 
syndrome  is  a  subset  ofX. 

Here  again,  no  decent  algorithm  is  known,  although  Yang  et  al.  ([Yang86]  )  gave  an  0(n^^) 
algorithm  for  t/r-diagnosable  systems,  using  ideas  from  [Dahb84]  Also,  Chwa  and  Hakimi 
([ChwaSl]  )  have  given  an  optimal  algorithm  for  Dx,s,t  systems  (which  are  restricted  versions  of 
r/r-diagnosable  systems).  So,  except  for  the  co-NP  completeness  result  of  Sulli.an,  little  is  known 
about  general  t/s-diagnosable  systems. 
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The  following  table  summarizes  the  state  of  affairs  in  the  fault-diagnosis  area: 


Diagnos- 

ability 

measure 

Problem  type 

Characterizatimi 

Diagnosability 

Diagnosis 

t 

Solved 

Solved 

Solved 

Open;  but 

Open;  but 

Open;  but 

Seq.  t 

solved  for 

solved  for 

solved  for 

special  graphs 

special  grains 

special  graphs 

Open;  but 

Solved: 

Open;  but 

t/s 

solved  for 

generally 

solved  for 

special  graphs 

intractable 

special  graphs 

4.  Our  results  and  directions  for  future  research 

We  have  used  concepts  from  set  partitioning  to  solve  the  characterization  problems  mentioned 
in  the  previous  section  and  which  are  still  considered  open  in  the  literature[Ragh86]  The  characteri¬ 
zation  reveals  some  very  interesting  and  heretofwe  unki.own  properties  of  t/r-diagnosable  systems. 
We  hope  to  take  advantage  of  these  pix^rties  to  generalise  the  earlier  wcmIc  ([Dahb84]  and 
[Yang86] )  to  produce  an  efficient  algorithm  for  t/t+k-diagnosable  systems.  We  are  in  the  process  of 
writing  up  our  results  for  publication  in  the  IEEE  Transactions  on  Computers. 

The  main  thrust  of  our  immediate  research  will  be  in  solving  the  other  open  problems  men¬ 
tioned  in  the  previous  section.  There  are  two  other  areas  in  fault  diagnosis  of  distributed  systems 
which  we  consider  worth  investigating  and  are  part  of  our  agenda.  First,  the  pioneering  work  of 


Masson,  Mallela,  Dahbura  and  Yang  [Mall78,Dahb84.,Yang86 . Yang86]  in  modeling  intermit¬ 

tent  faults  has  opened  up  a  fresh  and  very  practical  area  for  future  work.  Second,  the  problem  of  fault 
diagnosis  in  distributed  systems  with  dynamic  failure  and  repair  which  are  not  so  easily  modeled 
using  the  PMC  method  deserves  attention.  One  weakness  of  the  PMC  model  is  the  underlying 
assumption  that  the  test  results  are  available  simultaneously  to  a  "global  observer"  who  is  an  unmo¬ 
deled  entity  not  subject  to  faults.  This  runs  counter  to  fundamental  principles  of  distributed  comput¬ 
ing,  where  there  is  no  fault-free,  omniscient  supervisor.  Only  very  recently  has  research  been  aimed 
at  producing  systems  in  which  diagnosis  is  performed  by  the  modeled  units  themselves 
([Holt85,  Hoss84] )  and  the  outlook  is  promising. 
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1.  Introduction 

1.1.  The  PMC  model 

In  the  two  decades  since  the  introduction  of  the  so-called  PMC  model  [Prep67],  significant  progress 
has  been  made  in  the  development  of  the  theory  associated  with  the  model.  The  reasons  for  this  are  not 
difficult  to  find:  first,  the  simplicity  and  elegance  of  this  model  have  made  it  appealing  to  both  theoretical 
scientists  and  system  designers  alike.  Second,  the  universality  of  the  model  makes  it  suitable  for  capturing 
the  essence  of  diagnosis  in  a  variety  of  distributed  systems  as  well  as  VLSI  systems.  Third,  the  model  per¬ 
mits  a  level  of  abstraction  where  the  behavior  of  processors/software  modules  is  limited  to  just  two  states: 
faulty  or  non-faulty.  While  this  may  not  be  the  best  picture  of  "real-world"  processors  and  modules,  which 
are  inclined  to  a  mwe  fuzzy  behavior  between  the  two  extremes  of  being  faulty  and  non-faulty,  this 
Simplification  has  the  beneficial  effect  of  enabling  diagnostic  algorithms  developed  for  the  PMC  model  to 
applicable  in  all  layers  of  the  system  architecture.  Indeed,  one  may  drop  the  distinctions  between 
ha  dware  processors  and  software  modules  and  speak  only  of  units  when  modeling  the  system  for  fault 
diagnosis. 

A  distributed  or  a  multiprocessor  system  is  one  in  which  computational  lasKs  are  performed  by  multi¬ 
ple  units.  In  the  PMC  model,  two  principal  assumptions  are  made  about  the  system:  first,  the  variables  that 
characterize  the  fault  behavior  of  units  -  Mean  Time  Between  Failure,  Mean  Time  To  Repair  inter  cdia- 
are  assumed  to  be  independent  random  variables;  second,  the  units  are  assumed  to  have  the  capability  to 
administer  tests  among  themselves  for  diagnosis.  We  concern  ourselves  neither  with  the  precise  nature  of 
these  tests  nor  with  how  and  when  they  are  administered,  beyond  insisting  that  they  be  complete ,  i.e.,  a 
fault-free  unit  should  always  correctly  identify  the  units  it  tests  as  being  faulty  or  fault-free. 
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The  system  that  is  to  be  diagnosed  is  partitioned  into  logical  units.  These  units  need  not  be  similar  m 
their  functionality  within  the  distributed  system,  except  in  being  able  to  test  singly  or  in  combination, 
another  unit.  The  outcome  of  the  tests  may  be  classified  simply  as  "pass"  or  "fail",  indicating  that  the  test¬ 
ing  unit  evaluates  the  tested  unit  as  being  fault-free  or  faulty,  respectively.  We  assume  that  the  evaluation  is 
significant  only  if  the  testing  unit  itself  is  fault-fiee,  otherwise  the  outcome  is  unreliable.  This  assumption  is 
known  as  the  symmetric  invalidation  assumption  and  forms  the  basis  of  many  offshoots  of  the  PMC  model. 
An  alternative  assumption,  which  is  also  frequently  used  (and  considered  by  some  to  represent  real  systems 
more  closely)  is  the  asymmetric  invalidation  assumption,  wherein  a  "pass"  outcome  necessarily  implies 
that  the  tested  unit  is  fault  firee,  but  a  "fail"  outcome  only  implies  that  either  the  tested  unit  or  the  testing 
unit  or  both  are  faulty.  The  rationale  behind  this  assumption  [Bars76,  KimeSO]  is  that  a  complete  test  in 
systems  composed  of  complex  units  entails  the  checking  of  many  responses  from  the  tested  system.  There¬ 
fore,  it  is  extremely  unlikely  that  the  faults  in  the  units  performing  the  test  would  completely  cancel  the 
faults  in  the  unit  under  test,  causing  a  test  to  pass  when  it  should  have  failed. 

The  test  system  is  modeled  as  a  directed  graph  G  (V'  ,£ ),  with  the  units  represented  by  vertices  in  the 
graph  and  the  tests  by  directed  edges.  Thus  if  (i,y)  is  a  directed  edge  firom  unit  m  to  unit  uj,  the  unit  u, 
tests  the  unit  uj .  This  directed  graph  is  called  the  connection  assignment  of  the  system.  A  more  or  less 
natural  measure  of  the  "diagnosability"  of  a  connection  assignment  is  the  following: 

A  system  of  n  units  is  one-step  t -fault  diagnosable  (or  simply  t -diagnosable)  if  all  faulty  units 
within  the  system  can  be  identified  without  replacement  provided  the  number  of  faulty  units  {vesent  does 
not  exceed  t . 

The  diagnosability  problem,  then,  is  to  identify  the  largest  t  for  which  a  given  connection  assign¬ 
ment  remains  t  -diagnosable.  Sullivan[Sull841,  in  a  remarkable  tour  de  force,  presented  the  first  polynomial 
lime  algorithm  for  the  diagnosability  problem.  In  what  follows,  we  we  develop  some  concepts  which, 
though  not  completely  used  in  our  present  version  of  our  algorithm,  are  likely  to  improve  the  complexity 

5  -3 

even  further.  In  a  forthcoming  report,  we  use  some  of  the  concepts  presented  here  to  get  an  0(]  E|  ^  V] 
algorithm  for  t  -diagnosability. 
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1.2.  Preliminaries:  Definitions  and  notation 


For  a  given  connection  assignment,  represented  by  a  digraph  G(F,£),  the  diagnosability  number  xq 
is  the  largest  non-negative  integer  t  for  which  G  is  r-diagnosable.  Where  the  digraph  is  clear  from  the  con¬ 
text,  we  may  drop  the  subscript  in  Tg  .  A  total  function  a:E -»{0,1 )  is  said  to  be  a  syndrome  for  G .  The  in¬ 
tuitive  idea  behind  this  definition  is  that  since  every  directed  edge  represents  a  "test"  in  the  system  being 
modeled,  the  test  results  (or  the  "syndrome")  must  simply  be  a  collection  of  "passes"  and  "fails",  each  edge 
having  exactly  one  of  the  2  possibilities.  In  our  notation,  a  0  signifies  a  pass  and  a  1  a  fail. 

F  c  F  is  a  consistent  fault  set  for  the  syndrome  a  if  neither 

(a)  a(u ,v)  =  0  where  u e V-F  &  veF  nw 

(b)  a{u ,v)  =  1  where u,v  e  V-F 
holds. 

Fo  =  {F ;  F  is  a  consistent  fault  set  for  a} . 

F^_^  =  {F:FGFoi|Fl  ^r). 

The  definition  of  F_  *  allows  the  following  observations:  first,  a  syndrome  a  occurs  in  a  t -fault 

Of* 

situation  if  and  only  if  F^  ^  ^  0.  Second,  the  earlier  definition  of  the  diagnosability  measure  may  be  res¬ 
tated  as  follows; 

G(V^)ist  -diagnosable  if  and  wily  if  for  every  syndrome  a  for  G  in  a  r  -fault  situation,  I  F^  =  I . 

Let  V  e  V  be  any  vertex  of  G .  di,(v)  and  doui(y)  denote,  respectively,  the  in  and  out  degree  of  the 
vertex  v .  The  set  of  vertices  of  G  from  which  there  are  directed  edges  to  v  is  denoted  by  I^'v ,  i.e., 

r^*v  =  {u:MGV  &  <«,v>€£) 

rt'so, 

rv  =  {u:ueV  &  <v,«>eF) 

Let  X  c  F  be  some  set  of  vertices,  then 
r-'x  =  -  X 

rx  =  ^^rv-x 

When  we  are  dealing  with  mote  than  one  digraph,  we  sometimes  use  the  notations  To'  and  Tg  to  avoid 
ambiguity.  The  operators  1^'  and  T  take  precedence  over  union  and  intersection.  Thus 

r-'x  =  (r-'x)i^y. 
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A  tournament  7,  on  n  vertices,  is  a  digraph  in  which  every  pair  of  vertices  u  ,v  contributes  exactly 
one  of  (u  ,v )  and  (v  ,u )  to  the  set  of  edges,  i.e.,  there  are  no  directed  cycles  of  length  2  and  the  total  number 

of  edges  is  -  ^  weighted  tournament  is  a  tournament  in  which  every  vertex  is  assigned  a  positive 

weight 

2.  Properties  of  minimal  bottleneck  sets 

The  following  theorem  provides  a  completely  graph-theoretical  characterization  of  t  -diagnosability. 
For  a  proof,  see  the  referenced  papers. 

Theorem  1  [Sull84,  Alla75] 

G(y^)  is  r -diagnosable  if  and  only  if  VZ  sV  [(2^*0)  =>  +|  I^’Z|  >  ;].  □ 


Let  T  be  the  diagnosability  number  of  G .  Theorem  1  implies  that  there  exists  a  non-empty  set  Z  £  V 


such  that 


+ 1  r^'Zl  =  t+l.  This  motivates  the  following  definition. 


Definition  A  bottleneck  ret  of  G  (V  vE )  is  a  set  Z  c  K  which  satisfies 


+  1  r^’Z|  =  x-t-l,  where  T  is 


the  diagnosability  number  of  G .  A  minimal  bottleneck  set  is  a  bottleneck  set,  no  smaller  subset  of  which  is 


also  a  bottleneck  set.  The  bottleneck  function,  Oo .  of  a  set  X  cV  is  defined  by  <l)c  (X )  = 


+|r-'x| . 


When  the  context  permits  no  ambiguity,  we  will  drop  the  subscript  in  <^G  • 


Theorem  2  The  cardinality  of  a  minimal  bottleneck  set  is  either  1  or  even. 

Proof  -  Suppose  not.  Let  Z  £  V,  for  some  digraph  G(V,£),  be  a  minimal  bottleneck  of  odd  (>3)  cardinali¬ 
ty.  Let  veZ  be  any  vertex.  Then  Z'  =  Z-{v)  satisfies 


4 


+  1  r-'zl  < 


+ 1  I^'Z|  =  <I>(Z),  which  contradicts  the  minimality  of  Z .  □ 


Theorem  3  Let  Z  be  a  minimal  bottleneck  set  of  G{V  JS ).  Then  the  subgraph  induced  by  Z  is  strongly 
connected. 

Proof  -  Suppose  not.  Then  there  is  a  strongly  connected  component  Z'  in  the  subgraph  induced  by  Z, 

which  is  a  proper  subset  of  Z  and  has  no  arcs  entering  it  from  within  Z.  Therefore, 

r  1  r  T  r  1 


ch(Z0  = 


+  1  r-'zl  < 


r-'z|  < 


+  1  I^'Z|  =  dH.Z),  which  is  a  contradiction.  □ 


In  what  follows,  we  show  that  a  minimal  bottleneck  set  satisfies  a  property  which  is  stronger  than 
strong  connectivity.  We  call  this  property  "collapsibility". 
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Definition  Z  is  collapsible  iff 

(a)  I  Zj  =  l.or 

(b)  Z  is  the  union  of  2  collapsible  sets  X  and  Y  such  that 
X  p,  0  and 

Y  n  ^  0. 

We  need  a  few  lemmas  to  prove  that  a  minimal  bottleneck  set  is  collapsible. 
Lemma  1:-  Vn , «  >1,  there  exists  a  tournament  T*  on  n  vertices  which  satisfies: 

(a)  If  n  is  odd,  Vv  e  di„(v)  =  4«,(v)  = 


(b) 


If  n  is  even,  there  exist  vertices  v  such  that  d,«(v)  +  1  =  dou,(v)=  and 
din(v)  =  dou,(v)+  1  =  -j. 


■j  vertices  v  such  that 


Proof-.-  By  induction.  As  basis,  observe  that  (a)  holds  trivially  for  n  =  1.  Assume  that  (a)  and  (b)  hold  for 
ally  < n. 


Case  1  n+1  is  even.  Then  n  is  odd.  Build  Th*i  as  follows.  By  the  induction  hypothesis,  there  exists  a 
tournament  on  n  vertices  which  satisfies  (a).  Add  a  vertex  v,+i  to  this  tournament.  Connect 
to  vertices  of  T„  and  from  the  remaining  -^^+1.  Then  the  resulting  graph  has  vertices 

of  in-degree  and  out-degree  and  vertices  with  in-degree  and  out-degrec 
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which  satisfies  (b). 


y  vertices  of 
in-degree  -j-l  and 
out-degree 


■j  vertices  of 
in-degree  and 
.out-degree  y-l. 


Case  2  n+1  is  odd.  By  the  induction  hypothesis,  we  have  a  tournament  I,  which  satisfies  (b).  Build  Th+i 
by  connecting  an  extra  vertex  v„+i  to  all  the  vertices  of  T„  which  have  in-degree  of  -j-1  and  from 

all  the  vertices  which  have  out-degree  of  y-1-  "The  resulting  tournament  on  n+1  vertices  satisfies 
(a),  since  all  the  vertices  now  have  in-  and  out-degree  of  -j.  □ 

Definition  A  balanced  tournament  is  a  tournament  which  satisfies  (a)  and  (b)  of  Lemma  1. 

Lemma  2;-  Let  rn  be  a  tournament  on  n  vertices.  Then, 

(a)  If  n  is  odd  3v  g  1^(7,)  [^ +(  r-'v|  < -J] 

(b)  Ifn  is  even,  3v  g  V(Tn)  [y  +|  r-’v|  <  -5^] 


Proof:- 


Erfr.)  vevfr,) 


n(n-l) 
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av  e  v(r,)  1  i^*v|  < 


If  A  is  even,  the  inequality  above  is  strict  and  the  lemma  follows.  □ 

Lemma  3:-  let  7,  be  a  weighted  tournament  on  n  vertices  with  weight  function  w:  ->  Z*.  Then, 
3v  e  V^(r,)  +  a)(r^‘v)  5  ]. 

Proof.-  By  contradiction.  Let  T.  be  a  tournament  with  weight  function  o)  such  that 
Vv  e  V(r,)  +  0)(r^‘v)  >  ].  Let  Af  =(0(V).  Number  the  vertices  vi,V2,V3,  •  •  •  ,v„  and  let 

their  weights  be,  respectively,  a  i/i2,n3,  •  •  •  ,n« .  Then  we  have  =  M  and 

Vi  ^■KO(r-‘v)>^  (A) 


Build  a  tournament  r2M  on  2M  vertices 

Vll,  Vi2,  Vi3,  •••,  Vi^_,  Vi,a,+1,  '  ’  ' 

V21.  V22,  V23,  •  •  •  ,  V2,a,.  V2,<i,+1,  •  •  •  V2J«, 


v»l,  V,2,  V,3,  •  ■  •  ,  Vn^>l,  •  •  •  V^Ja, 

with  edges  defined  by  the  following  two  rules 

(a)  Ifi5ty,then  VA,i  1^^2a,,  (va,vyj)€  £(72m)  iff  (v,,vy)€  £(7,). 

(b)  Vi  1^'^,  add  edges  loEiJut)  so  that  the  subgraph  induced  by  the  row  va,  v,2,  •  •  • ,  is  a 
balanced  tournament  on  2a,  vertices. 

By  lemma  2(b),  3v  e  V(J7m)[^  +\  ff^vl  5  A/-y],  Without  loss  of  generality,  let  vn  be  a  vertex 
which  satisfies  this  inequality. 

Then, 

1  rfWnl  S2ti)(rf^*vi)  +  ai-l 

=>  2to(rf;vi)  +  a,-l  <iV-l 

=>  (B(rf^*vi)  +  ^ 
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which  contradicts  (A).  □ 

Theorem  4:-  Every  minimal  bottleneck  of  G  is  collapsible. 

Proof:-  Suppose  not  Let  Xc^CG)  be  a  minimal  bottleneck  which  is  not  collapsible.  Then  X  can  be  ex¬ 
pressed  as  the  union  of  m  >  2  disjoint  collapsible  sets  X  i  ^  2. '  "  such  that 

Vi  j  i:itj  =>  r\Xj=0 or  r~% 

Build  a  weighted  tournament  on  m  vertices  vi,V2,  *  -  -  v,,  with  weight  function  a>  as  follows;- 
a)(v,)  =  |X,|  l<i<m 

The  edges  of  this  tournament,  £  (T*)  are  built  as  foUows:- 

For  every  pair  of  vertices,  Vi  and  vy,  exactly  one  of  (vi,vy)  or  (vy,v,)  must  belong  to  E(T„).  If 
r^'X,  (^Xj  *0,  then  let  (vy  ,v,)  belong  to  £  (T*),  else  if  both  I^*X,  Xy  and  I^'X*  pj  X;  are  empty,  let 
(v,  ,vy)  belong  to  £ (T„)  if  and  only  if  i  <  j.  Qearly,  the  edges  as  defined  above  will  produce  a  tournament 
on/n  vertices. 

By  lemma  3, 

3v  6  V(T„)  +  oKTfV)  S  ] 

=>  3i  + 1  r-'Xil  5  +1  r>X|  ] 

=>Bt  1^'  ^  [  OCX, )  ^  0(X )] 
which  contradicts  the  minimality  of  X .  □ 

Corollary  [Haki74]  If  G(V,£)  is  a  digraph  with  no  directed  cycles  of  length  2,  then  the  diagnosability 
number  of  G  is  the  minimum  in-degree  of  any  vertex  in  G . 

Proof:-  If  there  are  no  directed  cycles  of  length  2,  then  the  largest  collapsible  set  has  cardinality  1.  There¬ 
fore,  every  minimal  bottleneck  set  consists  of  a  single  vertex  and  the  bottleneck  function  for  minimal 
bottlenecks  is  minimized  when  a  vertex  of  minimum  in-degree  is  chosen.  □ 
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Note  1:-  In  the  proof  of  the  above  theorem,  we  are  generous  in  adding  edges  to  get  a  tournament  on  m  ver¬ 
tices  even  when  the  original  graph  may  not  have  had  such  edges.  This  raises  the  question  of  whether  the 
theorem  can  be  improved  to  produce  the  following  claim; 

A  minimal  bottleneck  is  strongly  collapsible,  where  strong  collapsibility  is  defined  by: 

Definition  X  cV  (G )  is  strongly  collapsible  iff 

(a)  |X|=l,or 

(b)  X  can  be  partitioned  into  2  strongly  collapsible  sets  Y  and  Z  such  that 
3v  £  y,3z  G  Z  [(v,2),(2,v)g  £(G)]. 

That  a  minimal  bottleneck  set  need  not  be  strongly  collapsible  is  seen  by  the  following  example: 


Minimal  bottleneck  set  ^  vertex  complete 


Here,  the  unique  minimal  bottleneck  set  { A ,  fl ,  C ,  D )  is  not  strongly  collapsible  since  the  only  par- 
i  )n  into  two  stimgly  collapsible  sets  does  not  satisfy  (b)  of  the  definition.  □ 

Note  2:-  Can  Theorem  4,  howeva-,  be  strengthened  as  follows; 

Let  Z  be  a  minimal  bottleneck  set  and  let  vgZ  be  some  vertex  in  Z.  Then  there  is  a  collapsing  sequence 
starting  with  v  which  collapses  Z,  where  we  define  a  coUtqtsing  sequence  as  follows: 

Definition  A  collapsing  sequence  is  a  sequence  of  m  vertices  vi,V2,V3,  ■  •  ■  ,v„  such  that 
Vj  2Si<m  V,  G  {vi,V2,  ■  ■  •  ,v,_i )  and  I^'v,  {vi,V2,  •  •  ■  ,Vi_i )  *  0.  Here  m  is  the  length  of  the  col¬ 

lapsing  sequence.  We  say  that  a  vertex  v  gZ  collapses  Z  if  there  is  a  collapsing  sequence  starting  with  v 
of  length  I  Z|  which  consists  only  of  vertices  in  Z. 
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It  should  be  clear  that  not  all  collapsible  sets  need  have  the  property  that  they  are  collapsible  from 
every  vertex  contained  in  them.  In  fact,  one  can  quite  easily  build  coll^sible  sets  which  do  not  contain 
even  a  single  vertex  from  which  the  entire  set  may  be  coll^sed.  Unfortunately,  it  turns  out  that  minimal 
bottleneck  sets  may  have  vertices  from  which  they  cannot  be  collapsed,  as  the  following  example  illus¬ 
trates; 


In  the  above  digraph,  the  minimal  bottleneck  set  is  again  unique,  and  consists,  in  fact,  of  the  entire 
set  of  vertices  However,  there  is  no  collapsing  sequence  starting  with  A  which  collapses 

the  set.  □ 

We  do  succeed  in  proving  (Theorem  6)  a  weaker  form  of  the  claim  envisaged  in  the  above  note,  viz,, 
that  a  minimal  bottleneck  set  consists  of  at  least  one  vertex  from  which  it  may  be  collapsed.  Theorem  5 
below  is  a  refinement  of  Theorem  3.  We  show  that  the  distance  between  any  2  vertices  in  a  minimal 
bottleneck  set  must  be  quite  small. 

Definition  The  distance  from  vertex  u  to  vertex  v,  denoted  by  disi{u,v)  is  the  length  of  the  shortest 
directed  path  from  u  to  v .  If  no  path  exists,  then  distiu  ,v)  =  «>. 
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Theorem  5:- 


Let  li  ,v  e  Z ,  where  Z  c  V' (G )  is  a  minimal  bottleneck  set  Then  dist  (w , v )  <  1  +  log2(2t  +  1 ),  where 
T  is  the  diagnosability  number  of  G . 

Proof:- Let  =  [x:  xeZ  &  d/j/ (-t,v)  =  i).  Thus  Co  =  (v).  Let  Yi  be  defined  by 
Yo  =  r^*v  -Z 


and  for  i  >  0, 


(Look  at  picture  below.) 


Let  />  be  the  least  number  such  that  C*  ^  0-  Then  Z  =  and  r"‘Z  =  . 

By  the  minimality  of  Z .  we  get  the  following  h  inequalities: 

iCd 


ICouCil 

- 2 — 


+lY(i  +1  Cil  >  t  +  1 

+  N  +1  Yil  +1  U  >  t+1 


>  T+  1 


Substituting  x  +  1  = 
terms,  we  get 


If 


+  1  I^*Z|  =i  +1 1,^1 1  in  above  inequalities  and  rearranging 


I  a  > 


I  a  > 


T 


ic*i  > 


\^\ 

~T 


+  1  Y*l 


Since  a  > 


b 

T 


+  c  =>2a>h+2c,we  have. 


1  U  >  I  yC.1  +  2| 
I  Cd  >  I  +  2| 


I  0.1  >  2(  Y*l 
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Unfolding  the  recurrence  inequalities  backwards  and  allowing  for  |  7,|  =  0,  1</  <h ,  we  have, 


ICaI  =ic*i 

1  C*-il  S  I  C*l  + 1 
ICa-J  >  21C*1  +2 
IC*-3l  ^2^Ca|  +3 


ICil  >  2*^0.!  +(h-l) 

ICd  =1  {v)|  =  1 

Since 

2t  =  +lr-'z|  -1) 

5  I  Z|  +  2|  r-'Z|  -  2 
2|Zl  -2 

S  2M  CaI  +l+^^^'^-2 

we  get 


h  <  10g2(  ^-)  +  1 

Since]  CaI  -  1,  we  have  the  desired  inequality.  □ 
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Definition  X  c,V (G )  is  sequentially  collapsible  if  there  exists  a  vertex  veX  which  collapses  X. 

We  need  the  following  lemma  to  prove  that  a  minimal  bottleneck  set  is  sequentially  collapsible. 

Lemma  4  Let  G  ( V  ^ )  be  a  digrafrfi  such  that  V  =  Qx, ,  where  each  of  the  X,  s  is  neither  empty  nor  equal 
to  V  and 

Vi  i</-^  r->x,  rx.  =  0  (A) 

holds. 

Then  for  some  X, ,  l<i  ^ , 


3Y^i  [y*0&  <I>(T)< 


4 


1 


Proof  The  proof  is  by  inducticHi  mi  the  cardinality  of  V.  As  basis,  observe  that  if  |  V|  =  1,  the  antecedent  of 
the  lemma  can  never  be  satisfied  since  it  is  impossible  to  find  a  non-empty  subset  of  a  1-element  set  which 
is  not  equal  to  the  set.  Therefore,  the  lemma  is  vacuously  true  in  this  case.  Assume  that  it  is  true  whenever 
[  V'l  <n.  We  must  show  that  it  holds  for  I  V|  =n+l. 

Case  1  /i-i-lisodd. 

Consider  the  subgraph  G  i  ( V  i  ,£  i )  induced  by  deleting  some  vertex  veV.NowIVi]  =|V|-l  =  nis 
even.  If  there  exists  some  Xi  l<i^m  such  that  X,  =V],  then  certainly 

(X, )  =  y  I  r^'X.  I  <  Y  +  1  lemma  is  proved  by  letting  T  =  X, . 

If  there  is  no  such  X,,  then  since  V\  =  ^»Xi^  where  X/  =  X,  -  (v),  and  since  (A)  holds  in  the 

induced  subgraph  as  well,  by  the  induction  hypothesis  there  exists  a  non-empty  subset  Y  of  some  X, '  such 
that 


^gSY)= 


4 


+1  rc;;y|  < 


V, 


n 

T 


But  then. 


(ho  (>')<<I>G,(l')+  1  <1+  1  =  i  ^ 


and  the  lemma  holds. 
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Case  2  n+liseven. 

(This  case  is  much  mcsre  difficult  than  the  previous  case  and  the  proof  is  in  4  parts.  In  the  first  part,  wc 
show  that  any  given  digraph  which  satisfies  the  antecedent  of  the  lemma  can  be  convened  to  an  equivalent 
digraph  in  which  an  additional  key  property  is  satisfied.  This  property  provides  the  inequality  necessary  to 
identify  at  least  one  subset  Y  which  will  prove  the  lemma.  In  part  2,  we  define  an  equivalence  relation 
which  partitions  the  given  digraph  into  disjoint  equivalence  classes  of  vertices;  the  rest  of  the  proof  is 
directed  toward  demonstrating  that  one  of  these  equivalence  classes  is,  in  fact,  the  Y  needed  to  complete 
the  proof.  With  this  in  mind,  we  transform  the  digraph  so  that  every  vertex  in  any  equivalence  class  At  has 
exactly  the  same  in-  and  out-neighbors  from  other  equivalence  classes  as  the  other  vertices  in  At ;  the  sub¬ 
graph  induced  by  the  equivalence  class  itself  is  made  a  balanced  tournament.  This  has  the  effect  of  setting 
up  a  correlation  between  the  in-degree  of  every  vertex  in  the  equivalence  class  At ,  and  d>(/4* ).  In  part  3,  wc 
prepare  the  ground  for  the  final  pigeonholing  argument  in  part  4,  which  will  demonstrate  that  there  must  be 
some  vertex  which  has  a  small  enough  in-degree  for  the  equivalence  class  to  which  it  belongs  to  satisfy  the 
requirement  of  the  lemma.) ) 

Part  1 

In  this  part,  we  show  how  the  given  graph  G  may  be  modified  to  obtain  a  new  graph  in  which  some 
key  properties  are  true.  More  precisely,  we  have  the  following  claim; 


Claim  1 

Given  a  graph  G  which  satisfies  the  antecedent  of  the  lemma,  there  exists  a  graph  G\V'^')  which 
satisfies: 

(a)  V'  =  V 

(b)  =  •’  "'here  of  the  is  a  non-empty  set  not  equal  to  V\  and  m'>m,  and 

Vi  1^'^'  3;  l<j<m  [X',  cXy]. 

(c)  Vi  1^-  ^ '  [TclX piTc-X  =  0] . 

(d)  Vi  1^-^'  VTcX'i  = 

and 

(e)  ViJ  if  Tc-X'^PyX',  ^  0  thcnl  rc^X'/pX'.]  >  |  rG-X'.pX',!  . 
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Proof  of  Claim  1  Note  that  (a)-(d)  are  satisfied  by  setting  X  ,  =  Xi  1^'  ^  and  m'  =  m.  The  only  new  re¬ 
quirement  is  (e).  If  it  is  already  satisified  in  G,  we  are  done.  So  suppose  that  for  some  pair  of  sets 
Xi  Xj  jSm ,  (e)  does  not  hold.  Therefore  FcXy  ^  0  and  1  Fo'Y^  pv^/l  5 1  ToXi  pXy|  .  (Look  at 
figure  below.) 


rc'XypX, 


VaXir^j 


.rcXjr^i 


rc'x.pxy 


Clearly,  (rc'XypX,)  p  (TcXypY,)  =  0  and  (roXipX^)p(rG’X.pX^>)  =  0.  else  (A)  of  the  lemma 
is  not  satisfied  for  G . 

Let  Xii  =  Tc'XypX,  and  X. 2  =  TcXypY,  .  Let /:  Xa  ->  rcX.  p^y  be  any  one-one  mapping.  Such 
a  mapping  is  possible  since  Xa  ^  1  FcXi  p^/l  . 


Now,  for  every  vertex  v  6  X,2,  delete  all  directed  edges  from  vertices  u  e  Xu  and  replace  each 
deleted  edge  by  a  new  edge  from  /  (u ),  which  is  in  FeX,  pX; . 

Call  the  graph  thus  obtained  //|.  Let  mj  =  /n  +  1.  In  //i,  F//_XapX,2  =  0,  so  (e)  is  satisfied  between 
X, ;  and  X,2-  Similarly,  (c)  is  satisfteti  between  X.  and  Xa.  and  between  X.  and  X,2. 


It  can  be  verifed  that  (a)-(d)  are  still  satisfied  for  all  the  m  i  sets  in  // 1 . 

If  Xi^j  was  the  only  pair  of  sets  which  violated  (e)  in  G ,  then  we  are  done  by  setting  G'  =  11 1  and 
m'  =  m\.  Else,  we  repeat  the  above  transformation  for  some  pair  of  sets  in  // 1  for  which  (e)  is  not  satisfied 
to  obtain  Hz  and  so  on.  The  sequence  of  graphs  G=Hq,H\,Hz,  •  •  obtained  by  these  repeated  transfor¬ 
mations  is  finite  and  the  last  graph  G '  in  the  sequence  satisfies  all  the  properties  (a)-(e).  □ 

Note  that  it  suffices  to  prove  case  2  in  the  transformed  graph  G '  because  of  properties  (b)  and  (d). 

For  this  reason,  and  in  the  interests  of  avoiding  avoidable  superscripts,  we  will  suppose  without  loss  of  gen¬ 
erality  that  the  given  graph  G  satisfies  properties  (a)-(e). 

Part  2 

Define  an  equivalence  relation  =  on  V  as  follows: 

«  =  V  if  and  only  if  for  every  i  1^'^,  either  both  u  and  v  belong  to  X,  or  both  u  and  v  do  not 
belong  to  .Y,,  and,  in  addition,  if  u  and  v  do  belong  to  Y,  then 

Vy  l:<y^  lO'^)  =>  ((FeU  ^0=>rc‘v  r)^j=0)  &  (Fev  p,Y,  ^0=^Fc'u  r^Xj=0))]. 

That  =  is  an  equivalence  relation  may  be  easily  verfied.  Being  an  equivalence  relation,  =  partitions  V 
into  a  finite  number  of  equivalence  classes  AiA2A‘3,  '  which  are  pairwise  disjoint  and  do  not  span 
set  boundaries,  i.e,  Vik  Vi  J  ii*j)  iAkC^i  *0d  Ak^S^j  *0  => 

Build  a  new  graph  G  i(F i,£ 0  from  G  using  the  following  rules:- 

(a)  V  =Vi. 

(b)  The  edges  in  £  j  are  built  according  to  one  of  2  rules: 

(bl)  Let  AkAt  be  any  2  distinct  equivalence  classes  induced  by  =  in  F.  For  every  pair  of  vertices 
u,v  such  that  u  s  Ak  and  v  e  A;,let  (u,v)  e  £i  if  and  only  if  Fc'^A;  P)  /A*  *0. 

(b2)  Add  enough  edges  to  £i  so  that  the  subgraph  induced  by  every  A*  1^  <^  is  a  balanced  tour¬ 
nament. 

Note  that  G  \  as  constructed  above  preserves  properties  (a),  (b),  (c),  and  (e)  of  claim  1,  but  (d)  may 
be  violated. 

Claim  2 

Let  A*  be  any  equivalence  class  induced  by  =  in  F  and  let  v  e  A*  be  some  vertex.  Let  d,^'(v )  denote 
the  in-degree  of  v  in  Gj.  Then  d^'(v)  >  <bc(A*)  -  1. 


'8 


Proof  of  claim  2 


d?n'{v)  =  I  rcjvl 


=  1  Tclv  p)Atl  +1  rajv  p,A*| 


Now]  TgIv  P)A*1  ^ 


|A*| 


-  1  by  lemma  1  and]  Fclv  p^A*|  =rc*A*  by  the  construction  (bl) 


above.  Therefore,  the  claim  follows.  □ 


Claim  3 

Let  u  ,v  6  V  be  any  pair  of  vertices.  Then  there  exists  a  2-cycle  between  u  and  v  in  C  j  only  if  both 
the  conditions  below  are  satisfied: 

(i)  For  some  X,  1^'  <m ,  both  u  and  v  belong  to  distinct  equivalence  classes  in  X, .  Mneover, 
then  there  is  a  2-cycle  between  every  pair  of  vertices  x  ,y  where  x  e  A*  and  y  s  Ai. 

(ii)  There  exists  some  Xj  such  that  either  u  e  FcJXy  and  v  €  VcXj  or  u  e  FcXy  and 

V  e  rd\Xj. 

Proof  of  (i):  If  u  and  v  belong  to  distinct  sets,  X,-  and  Xj,  then  property  (c)  of  claim  1  is  violated  for  both 
Xi  and  Xj .  If  u  and  v  belong  to  the  same  equivalence  class  At ,  then  construction  (b2)  ensures  that  there  is 
anly  one  directed  edge  between  u  and  v .  Therefore  u  and  v  must  belong  to  distinct  equivalence  classes, 
say  At  and  Aj ,  which  are  both  contained  in  the  same  set,  say  X, .  Moreover,  all  x  €  At  are  equivalent  under 
=  to  « ,  and  all  y  e  A{  are  equivalent  under  =  to  v.  By  the  ccmstruction  (bl)  therefore,  there  is  a  2-cycle 
between  x  and  y  if  and  only  if  there  is  one  between  u  and  v . 

Proof  of  (ii):  By  (i),  u ,  v  belong  to  the  same  set  Xj ,  but  not  to  the  same  equivalence  class.  Therefore,  (ii) 
follows  from  the  definition  of  the  relation  =.  □ 


Parts 

The  aim  of  this  part  is  to  establish  that  |  Fj  <  ."(y  LI.  There  seems  to  be  no  straightforward 
pigeonholing  argument  to  prove  this  and  we  resort  to  another  construction; 

For  every  pair  u  ,v  of  vertices  in  V  i  such  that  a  2-cycle  exists  between  u  and  v ,  color  one  edge,  say 
(u  ,v),  red  and  the  other  edge  (v,u)  blue.  Now  repeat  the  following  algorithm  for  all  pair  of  sets  Xj,  Xy  , 
1</  <J<m  in  G 1. 

Step  1 : 

Delete  all  red  edges  (u  ,v)  such  that  one  of  the  following  is  satisfied: 
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1 


u  e  TaXjr^i  andv  e 

ii  «  G  ralXjf^i  and  V  G  raXjf^i. 

iii  u  G  FoXiO^/ and  V  G  rcJX.pyYy. 

iv  u  e  rd\Xi(^j  andv  B  FaXif^j. 

Step  2; 

Add  an  edge  (k,v)  between  every  pair  of  vertices  («,v)  such  that  u  e  and 

V  G  rcJXip^y  oru  G  FcXiCS^i  andv  g  TcXip^Xy. 

Call  the  graph  so  obtained  G2(y2^i)- 
Claim  4  G  2  has  no  2-cycles. 

Proof  of  Claim  4  All  the  ted  edges  must  have  been  deleted  in  step  1  above  because  of  claim  3(ii).  There¬ 
fore,  all  the  old  2-cycles  in  G 1  have  been  broken.  The  only  new  edges  added  in  step  2  are  between  vertices 
which  in  G 1  could  not  have  had  any  edges  between  them  (else  property  (c)  of  claim  1  would  have  been 
violated),  and  even  then  precisely  one  edge  is  added  between  every  such  pair.  □ 

Claims  >  |  >  |£,|  . 

Proof  of  Claim  5  Let  euj  and  e^j  denote,  respectively,  the  number  of  edges  deleted  in  step  1  and  the 
number  of  edges  added  in  step  2,  for  any  pair  of  sets  X,  and  Xj,  \£i  <  j  ^  m.  Lei  a,  b ,  c ,  d,  denote, 
respectively,  |rcJX^(OX,|,  |  ToXyp^X/l  ,  iTcXip^yl,  )  rc,'X, p)Xy|  .  Then,  e\ij<ab+cd  and 
ejij^ad  +bc. 

Now,  if  6  or  c  is  equal  to  0,  then  certainly  euj  =  =  0-  Else,  if  both  b  and  c  are  not  equal  to  0, 

then  by  property  (e)  of  claim  1,  which  holds  for  Gj,  a  >  c  and  d  >  b.  Therefore, 

a{d-b)  >  c{d-b) 

or  ad  +  be  >  cd  +ab 

which  implies  that  e2<y  >  e  uj  ■ 

From  the  above,  we  can  conclude  in  any  case  that  etij  ^  euj  for  any  pair  of  sets  X,  and  Xj . 
Therefore, 

(  Ej  =(  Ell  +"^  ^  (e2ij  -e\ij)  >  |  Ej 
and  since  there  are  no  2-cycles  in  C2, 
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n(n+l) 


Part  4 


Now  we  finally  prove  case  2  of  the  lemma. 


By  claim  5.  |  £il  ^  .  Also  1  £i|  =  d^'iv),  so  there  must  ejiist  some  vratex  v  e  V^i  such 


that  d?n'(y )  ^  y  -  Since  n+1  is  even,  and  d^'iv )  must  be  an  integer,  we  can  say  that  dU'iy)  ^ 

Let  At  be  the  equivalence  class  to  which  v  belongs.  Then,  by  claim  2,  we  have  d^'(v)  ^  d>o(At)  -  1. 
From  this  and  the  above,  we  can  conclude  that 

<Dc(A*)-l<-^ 


or,  <I'G(Ak)<-^  =  ^L 


The  lemma  then  follows  by  letting  Y  =  At.  □ 

Theorem  6  Every  minimal  bottleneck  of  any  digraph  G  (V,£)  is  sequentially  collapsible. 

Proof  Suppose  not  Let  Z  c  V  be  a  minimal  bottleneck  set  which  is  not  sequentially  collapsible.  For  each 
V  e  Z ,  define 

X,  =  [x-.  There  is  a  collapsing  sequence  vo.vi,V2. •  •  •  ,v*,  for  some  OSt  Z| ,  with  vo  =  v  and 
V*  =x  and  Vi  1^'^  v,  e  Z). 

Clearly,  w  e  Xy  =>  X*,  c  Xy .  Let  Xi,X2.  •  •  •  be  the  madmas  of  the  partial  order  defined  by  c  on 
the  set  (^{Xy),  i.e.,  Vi  l^'Sw  Vy  ISySwi  (i^j)  (X,  (LXj).  Then  Z  =  ^X,  and 

Vi  l<i^  Fc'Xi  (^FeX,  =  0  (else  X,  is  not  maximal.) 

Therefore,  the  subgraph  G  i  induced  by  Z  satisfies  the  antecedent  of  lemma  4.  Consequently,  for 
some  X.  l<i  <m ,  we  can  find  a  non-empty  subset  Y  of  Xi  such  that  <l>G,(y)  ^  •  But  then. 


<hG  (T)  <  C)g,(T)  + 1  Fg'ZI  <  + 1  Fg'ZI 


contradicts  the  minimality  of  Z.  □ 
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MISSION 


of 

Rome  Air  Development  Center 


RADC  plans  and  executes  research,  development,  test  and  selected 
acquisition  programs  in  support  of  Command,  Control,  Communications 
and  Intelligence  (C^I)  activities.  Technical  and  engineering  support  within 
areas  of  competence  is  provided  to  ESD  Program  Offices  (POs)  and  other 
ESD  elements  to  perform  ^ective  acquisition  of  C^I  systems.  The  areas 
of  technical  competence  include  communications,  command  and  control, 
battle  management,  information  processing  surveillance  sensprs, 
intelligence  data  collection  and  handling  solid  state  sciences, 
electromagnetics,  and  propagation,  and  electronic,  maintainability,  and 
compatibility. 


